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I, INTRODUCTION 


During the past decade, several studies have been made of the 
permanence of vocational interests as measured by modern inventory 
methods. Strong!? compared the vocational interests of college seniors 
with the vocational interests of the same men measured five years 
later. Burnham! and Van Dusen" compared the interests of college 
freshmen with their interests as college seniors. All three investiga- 
tions indicate that the constancy of vocational interests as measured 
by means of the Strong Vocational Interest Blank for Men is greater 
than one might expect in view of the literature on older methods of 
measuring interests.® 

In the present study, a comparison is made of the vocational 
interests of a group of high-school sophomore boys with their interests 
as high-school seniors. Again the Strong Vocational Interest Blank 
for Men has been used. This appears to be an appropriate extension 
of the series of studies, although one may logically expect to find the 
instrument less reliable at this lower age level. Interest in clarifying 
the effects of maturation and of training upon the constancy of 
interest test scores for various samples makes such comparative 
study desirable. Moreover, the needs of counselors and vocational 
advisers suggest the desirability of investigating the uses and limita- 
tions of this instrument for high-school samples. 





* The writers wish to acknowledge their indebtedness to Dr. Herbert 8. Conrad, 


who read and criticized the manuscript. 
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Il. THE DATA 


As a part of a seriatim study at the University of California 
Institute of Child Welfare, the Strong Vocational Interest Blank for 
Men was administered to a group of sixty-four high-school boys each 
year for three consecutive years. The first administration took place 
in 1936 when the boys were high-school sophomores; the last tests 
were in 1938 when the boys were seniors. The sampling was drawn 
from an Oakland, California, public high school; every effort was made 
to select a sampling representative of the typical male population 
of a modern urban high school. The mean age of the group at the 
time of first testing was 15.7 years. 


TaBLeE I.—PERMANENCE COEFFICIENTS FOR SrxtTy-FouR HIGH-scHOOL SOPHOMORE 
Boys TEsTepD AGAIN AFTER Two YEARS, AND PERMANENCE COEFFICIENTS 
FoR Two HuNDRED TWENTY-THREE COLLEGE SENIORS TeEsTED AGAIN 
AFTER Five YEARS. THE Data For COLLEGE MEN ARE FROM 
Strona’s Stupy!™ 


























High-school Strong’s college 
sampling men 
Interest scale : 

r with r with 

Re-test r| interest | Re-test r| interest 

maturity maturity 
ES Pt ee eae ee ee .65 —.11 84 — .18 
ERE SER gy a re ae .59 18 mi 26 
Life insurance salesman.............. .66 .23 .80 .20 
eee etek cle shane as 4 .49 84 .66 .67 
pf See .55 71 .67 68 
ee bees dtc biekd wees .48 34 .74 17 
Certified public accountant.......... 57 .33 .59* .37 

es a 6 nck eins ae Paced met ti ae [A Geer .73 





Nore: In the computation of the averages of the r’s, the correlations were 
squared, summed, divided by seven, and the square root of the result derived. See 
references 4 and 6. 

*The earlier form of the C.P.A. scale was used in Strong’s study’ of 
permanence. 


Each boy was asked to state a vocational choice at each testing. 
The relation of expressed vocational choice to the interest rating 
provided by the test has been reported upon elsewhere.? 
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In the present study, all interest blanks were scored using seven 
occupational scales which, according to Strong,®: are representative 
of various fields of interest. The seven scales are: Chemist, lawyer, 
life insurance salesman, teacher, Y.M.C.A. secretary, office worker, 
and certified public accountant. In addition, each blank was scored 
using Strong’s interest maturity scale, Carter’s masculinity-feminity 
scale,* the Young-Eastabrooks Studiousness Scale, and the single 
occupational scale most appropriate to the expressed vocational choice 
of the individual student. 

Numerical scores, as well as letter ratings, were secured. The 
former were converted into standard scores for use in statistical 
analyses; the latter have been used for comparison of scores of high- 
school boys with adult norms. 


III. RESULTS 


As a measure of the relative constancy or permanence of vocational 
interests, Pearson product-moment coefficients of correlation were cal- 
culated between scores from the first and third administrations of the 
blank. The coefficients are shown in Table I. Permanence coeffi- 
cients for the same scales have been reported by Strong" as a part of 
his study of the permanence of interests of college seniors; they are 
shown in the same table for comparison. Since maturity of interest 
may influence scores, coefficients indicating the relation of scores on 
the Interest Maturity Scale to the scores obtained on each of the 
vocational scales are shown in Table I. Similar data reported by 
Strong!? are also given. None of the coefficients of correlation given 
in Table I has been corrected for attenuation. The results, therefore, 
indicate the permanence of interests as actually measured by the 
interest inventory in its present form, and not the permanence which 
might be registered through use of a more reliable instrument. 

Test-retest coefficients of correlation vary from .48 to .65 for the 
high-school sample. Test-retest coefficients reported by Strong! 
vary from .59 to .84. For the high-school boys, the lowest coefficients 
obtained were those for the teacher and office worker scales. As indi- 
cated in Table I, these scales measure interests which correlate most 
markedly with interest maturity. 





* Carter’s masculinity-feminity scale was based upon data from high school 
boys and girls. The correlation with Strong’s published m-f scale is high, but the 
item weights are somewhat different inasmuch as Strong used large groups of adult 
men and women in developing his scale. 
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When the rank order of permanence coefficients is correlated with 
the rank order of interest maturity coefficients using the rank differ- 
ence method, a coefficient of —.79 is obtained for the high-school 
sampling and a coefficient of —.75 is obtained for the college men. 
Strong!? has suggested the hypothesis that “if the varying effects of 
interest maturity could be deducted from the various occupational 
scores, there would result correlations of permanence that would 
approximate each other for all the occupational scales; furthermore, 
that many such correlations would be considerably higher than those 
EG ee 

Study of Table I would lead to the conclusion that the vocational! 
interests of high-school sophomore boys are considerably less constant 
than the vocational interests of college seniors. It appears reasonable 
to attribute the lesser permanence of interests of young people in 
part to the effects of maturing, and in part to slightly lesser reliability 
of the interest inventory when applied to youthful subjects. 

Correlation of the permanence coefficients reported by Strong" 
with those obtained for the high-school sampling by the rank differ- 
ence method yielded a rho of .71. This figure may be considered an 
indication of general similarity in the rank order of the permanence 
coefficients of the two groups studied. Such agreement may very 
well be due in part to the greater reliability of some of the scales, and 
in part to the lesser effect of maturational changes upon scores for 
certain types of interest. 

A comparison of changes in ratings for each scale provides addi- 
tional information about permanence of interests. Five letter ratings 
are assigned to vocational interest scores for each scale of the Strong 
Vocational Interest Blank. The critical scores which separate these 
ratings lie at points between one quartile and 3.5 quartile deviations 
below the mean for the criterion group. A letter rating of ‘‘A’’ on 
a vocational scale indicates the presence of interests similar to those of 
persons successfully engaged in the occupation (the criterion group). 
Ratings of “B+,” “‘B,” and ‘“B—” indicate lesser amounts of such 
interests. ‘‘C”’ ratings indicate apparent absence of the interests of 
successful workers in the occupation. 

Evidence concerning the changes in ratings, and the direction of 
these changes, is presented in Table II. In order to permit com- 
parisons, the same method as employed by Strong! has been used in 
compiling Table II. Each rating has been considered as equivalent to 
one step on a continuum; consequently, a change in rating from 
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“C” to “B—” would be equivalent to a change of one step in an 
upward direction. A change in rating from “‘B—’”’ to “A’’ would be 
the equivalent of a change of three steps (“B,” “B+,” “A’’) in an 
upward direction. A change of rating from ‘“ B—” to “C’’ would be 
a change of one step in a downward direction. Table II presents the 
average changes in terms of steps upwards and downwards for each of 
the seven vocations. The net change in terms of the steps is shown 
in column five. The percentage of change is shown in column 6. 


TaBLE II.—CoMPARISON OF AVERAGE CHANGES IN INTEREST RATINGS BETWEEN 
Two HunpRED TWENTY-THREE COLLEGE SENIORS RE-TESTED BY STRONG! 
AFTER Five YEARS, AND Srxty-rour HiGu-scnHoo, Sornomores Re- 
TESTED AFTER Two YEARS 





























College seniors | Sampling of high-school boys 
g 
4 Aver- ‘ Aver- 
Interest scale convadl age — age Aver- Per 
BBE UP) down- | *8°"P"| down- age net} cent 
ward ward 
ward ward | change /| change 
change change 
change change 
I cs bia a wy oe oem % wal ean ee .42 .27 15 42 
TR Ce a ee .45 21 .25 .09 16 25 
Life insurance salesman......| .28 .18 .28 .16 12 28 
I oak te hn oie Dig edd .78 .10 .48 45 .03 50 
Y.M.C.A. secretary. .19 .12 .23 .39 — .16 36 
OGROS WORMOP.............-.0) OO .27 .50 .72 — .22 61 
Certified public accountant...| .57 .05 .14 .05 .09 | y 
I ite a) og 6c to 6 ad 44 | .16 .328 .304 
TT reer, 29 |] ... | .024] 











Table II shows that the interest test ratings of high-school students 
shift about (disregarding direction of change) approximately to the 
same extent as do the ratings of college seniors, but there are impor- 
tant differences in the pattern of changes. The high-school boys show 
a greater amount of downward change (e.g. from ratings of A to ratings 
of B, etc.) than is shown by college men. Considering the various 


interest scales separately, in no instance do the average ratings of 
college men show more downward changes than upward changes. 
But in two instances (Y.M.C.A. secretary and office worker) the 
changes of ratings of high-school boys are more often negative than 
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positive. This is not what would be expected if the course of develop- 
ment were marked by a steady increase in scores as boys grow older. 
The point is that the diagnostic value of the test is probably better 
manifested as the subjects increase in age, and the representative group 
of high-school boys includes many who do not really have well- 
developed patterns of interest. There is more random change com- 
plicating the results with high-school boys; in the case of college men, 
many more have well-established interests which increase more often 
than they disappear. The direction of changes shown in the case of 
high-school boys is often such as to suggest a trend not in conformity 
with the practical demands of the occupational world. Thus, as 
they grow older fewer have the interests of office workers (who are in 
demand in large numbers) and more have the interests of chemists 
and lawyers (crowded professions for which the average boy is 
unsuited). 

The percentages of cases in which the ratings change, shown in the 
last column of Table II, indicate that in the majority of cases the 
ratings are identical after two years. This is evidence that the test 
measures interests which have a considerable amount of stability even 
among high-school boys. This is true even though the interests are 
less stable than among older persons. The average amount of change 
is about a third of a letter-grade step; the letter grades so indicated 
cannot be regarded as equal steps, but there are five of them and the 
breadth of the categories can be inferred from the per cents falling in 
each category (see Table III). The data are sufficient to show that 
the average amount of change is not great. 

It will be observed that the high-school group show greater changes 
on the office worker and teacher scales than on any other scales. 
Permanence coefficients for these scales are .48 and .49, respectively 
(Table I). The percentages of change on these two scales are sixty- 
one per cent and fifty per cent, respectively. In the case of the 
teacher scale, the net change is misleading if taken alone. The size 
of the upward and downward movement indicates the instability of 
these interests evidenced by high-school students. 

The smallest amount of change and the greatest stability of rating 
are shown for the C.P.A. scale. Although the college men changed 
markedly on this scale, the high-school group show a constancy of 
interest of ninety-one per cent. This means that ninety-one per cent 
of the ratings on the C.P.A. scale remained the same over the two-year 
period. The lawyer and life insurance salesman scales show stability 
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of interest amounting to seventy-five per cent and seventy-two per 
cent. 

Comparison of the average changes in each direction shows marked 
differences between the high-school and college samplings. There is a 
greater upward change in ratings for the college men than for the 
high-school boys. There is a much greater change downward for the 
high-school boys than for the college men. As a result of these dif- 
ferences, the college group indicate an average change in rating of 
.288 steps upward over a period of five years, whereas the high-school 
group indicate an average change of .024 steps upward over the two- 
year period. The marked changes in the high-school group’s ratings 
have cancelled each other, leaving a small average change. The 
results suggest that the important differences between the interests of 
high-school sophomores and of college seniors are not merely age 
differences, but also differences due to selection of sampling. 


TaBLe III].—DtstrisvuTion or Ratines ON SeveEN OccuPATIONAL INTEREST 
ScaLes ror Srxty-rour HicH-scHoo,t SorpHomore Boys TeEstTep 1n 1936, 
AND RE-TESTED IN 1938 














Per cent | Fractions of those receiving a given rating in 1936 who 
: ratings received the same or another rating in 1938 
Ratings obtained 
in 1936 C B- B B+ A 
A 5.6 .45 .22 .89 1.34 2.67 
B+ 8.5 2.23 1.12 1.56 2.23 1.33 
B 12.7 4.91 1.56 2.67 1.33 .45 
B- 6.9 2.90 1.12 1.12 1.33 .45 
C 66.3 55.36 3.79 5.13 1.33 .67 
Totals....| 100.0 65.85 7.81 11.37 7.56 5.57 























A further indication of the constancy and permanence of interests 
is provided in Table III. The distribution of ratings received in 1936 
is indicated in a single column of this table, in percentage form. The 
further distribution of these ratings, according to the ratings received 
in 1“°8, is indicated in the main body of the table. That is, the per- 
c.:.Lages shown in columns 3, 4, 5, 6, and 7 are the percentages of those 
students who received a given rating in 1936 and who received the 
same or another rating in 1938. For purposes of comparison, similar 
data reported by Strong!* are given in Table IV. 
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For the high-school group, sixty-four per cent of the ratings are 
identical in test and retest, and seventy-nine per cent of the ratings 
agree within one rating or letter grade. Strong summarized his data 
by stating that sixty-three and three-tenths per cent of the ratings 
were identical in test and retest, while eighty-four and six-tenths per 
cent of the ratings agreed within one rating or letter grade. It will be 
noted that the percentage of identical ratings is approximately the 
same for both groups, but that the dispersion of related ratings is wider 
for the high-school group. 

A comparison of the two tables indicates that the high-school 
sampling shows greater percentages of ‘‘C’”’ ratings in 1938 for each 
of the 1936 test ratings and smaller percentages of “‘B+” and “B” 
ratings in 1938 for each of the 1936 test ratings. Of the five and 
six-tenths per cent of ‘‘A”’ ratings received in 1936, 2.67 per cent 
remained “‘A”’ ratings in 1938, etc. Thus slightly over half the 
persons who received A ratings in 1936 received lower ratings in 1938. 
TaBLE IV.—DistTrRIBuTION OF RATINGS ON SIXTEEN OCCUPATIONAL INTEREST 


ScaLeEs FOR Two HuNpDRED TWENTY-THREE COLLEGE MEN TESTED IN 1927 
as SENIORS AND RE-TESTED IN 1932. Data From E. K. Strona’s Srupy" 


























Per cent | Fractions of those receiving a given rating in 1927 who 
: ratings received the same or another rating in 1932 
Ratings obtained Os Ca a ee menue 
in 1927 C B- | B | Bt | A 
| 
A 4.2 03 06 0.3 ‘2 | 2.7 
B+ 8.0 5 7 19 | 2.6 2.3 
B 11.6 2.2 1.6 $7 | a7 1.3 
B- 9.2 2.8 21 | 2.8 | 1.2 0.3 
C 70.0 52.2 6.0 | 6.2 | 2.0 0.6 
— — ——|—— — 
Totals....| 100.0 57.7 | 10.5 | 14.9 | 9.7 | 7.2 











It will be noted that the group of high-school sophomores received 
a smaller percentage of “‘C”’ ratings in 1938 than did the college 
seniors at the time of their first test. At the end of two years, the 
high-school group received approximately the same percentage of “C”’ 
ratings; the college group, at the end of five years, received a lesser 
percentage of ‘‘C”’ ratings than they received at the first testing. 
In 1936, and again in 1938, the high-school sampling received slightly 
more ‘‘A”’ ratings than did the college group on their original test 


or retest. 
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The data in Tables III and IV show that the high ratings of high- 
school sophomores regress toward the mean more than do the high 
ratings received by college seniors. This is true even though the 
interval between tests was five years for the college seniors and two 
years for the high-school sophomores. In making this comparison, it 
is assumed that the mean for the population is within the range covered 
by “‘C”’ ratings, since in both samples the “‘C’”’ ratings included more 
than fifty per cent of the groups tested. However, a main finding is 
that the distribution of ratings on both test and retest is very similar 
for high-school boys and for college seniors. Also, there seems to be a 
tendency for a larger percentage of “‘A’’ ratings to occur upon the 
second testing than upon the first testing. 

A somewhat different treatment of the data, shown in Table V, 
enables comparison of the results of four investigations, the present 
study and three others.!:!?-14 
TaBLE V.—PERCENTAGES, INDICATING DISTRIBUTION OF RATINGS ON First TEstTs 


AND RE-DISTRIBUTION OF THE RATINGS UPON ReE-Test: A COMPARISON OF 
Four INVESTIGATIONS 






































Distetbution of ratings on fast Percentages of those receiving a given rating on 
hike the first test, who received each indicated rating 
me on the second test 
| Per cent A B | 4 

A Ratings 
Strong’s study........ 4 63 36 | 1 
TE 5 35 57 | 8 
Van Dusen..........| 8 47 | 50 3 
oe 6 48 44 | 8 

B Ratings 
Strong’s study....... | 2 | 14 | 67 19 
Burnham..... Ses ae 33 3) 62 | 29 
Van Dusen.......... 39 10 56 | 35 
| FPPC Te 28 13 | 51 | 36 

| ——— 

C Ratings 
Strong’s study........ | 67 l 21 | 78 
See ee l 20 | 79 
Van Dusen.......... 53 2 18 80 
sc ccesenen's 66 l 16 83 
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With reference to the high-school sampling, it appears from Table V 
that a boy who received an ‘‘A”’ rating at the time of first testing has a 
forty-eight per cent chance of receiving the same rating two years 
later. If he received a “C” rating, there is an eighty-three per cent 
chance that he will obtain the same rating at the end of two years and 
there is only a one per cent chance that he will receive an “‘A”’ rating. 
After two years, a retest shows that an “A” rating will be replaced 
by a rating of ‘‘B”’ or higher in eighty-eight per cent of the cases, and 
that a “C”’ rating will be replaced by a rating of “‘B”’ or lower in 
ninety-seven per cent of the cases. 

With reference to Strong’s data, there is a sixty-three per cent 
chance that a college senior man who received an “A” rating will 
receive the same rating five years later; if he received a ‘“‘C’”’ rating, 
there is a seventy-eight per cent chance that he will obtain the same 
rating five years later, and only a .9 per cent chance that he will 
receive an ‘‘A”’ rating. After five years, retests showed that “A” 
ratings were replaced by ratings of “‘B”’ or higher in ninety-eight 
per cent of the cases, and that ‘‘C’”’ ratings were replaced by ratings of 
‘“‘B” or lower in ninety-six per cent of the cases. 

The percentages of “A,” ‘‘B,” and ‘‘C”’ ratings received by the 
various groups at the time when first tested are markedly similar. 
The distributions of “A,” “B,” and “C” ratings received upon 
second test, as indicated in the last three columns in the table, show 
considerable fluctuations, but indicate in the main that the constancy 
of interest test ratings of “‘A” and “B” was greater for Strong’s 
somewhat older group than for any of the younger groups. The 
differences, however, are not extreme. Moreover, the results indicate 
that the constancy of interest scores of the high-school boys studied in 
the present investigation is very like the constancy of interest scores of 
college undergraduates studied by Burnham! and by Van Dusen." 

This comparative study does not indicate that marked develop- 
mental changes in interest scores occur during the period between 
entrance into senior high school and graduation from college. The 
comparisons suggest, rather, that the development is gradual and not 
marked by any sudden or great changes. The comparative data 
might support the hypothesis that more changes in interests occur 
during the school period than occur after graduation from college. 
It is also possible that Strong’s group are specially selected in some 
way (e.g. are largely in some particularly broad occupational category 
characterized by relatively greater stability of interests). 
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Strong’s group was composed of two hundred eighty-three men 
who were seniors in college at the time of the first testing. They were 
retested each year thereafter for five years. Strong’s data given in 
Table V are based on a comparison of the ratings received five years 
after the original test with the ratings received in college. Both 
Burnham! and Van Dusen" administered the vocational interest 
blank to a group of college freshmen and compared the ratings received 
with the ratings of the same groups three years later. In age and 
experience, obviously the two latter groups are more like the high- 
school group than is Strong’s sampling. 

Although fewer ‘‘C” ratings were received by the high-school 
sampling on their first test than were received by Strong’s college 
senior group, a considerably higher percentage of the ‘“‘C” ratings 
remained constant over the two-year period between tests of the high- 
school group. Of the sixty-six per cent of “C” ratings in 1936, 
eighty-three per cent of that number were “‘C’”’ ratings in 1938. The 
high-school sampling would appear to be more stable than college men 
on this rating. It is, apparently, the most stable of the five ratings for 
a high-school sampling.* Absence of vocational interest in a field 
would seem to be more constant than the presence of vocational 
interests. Knowledge of this fact may be useful to the high-school 
counselor. 


IV. SUMMARY AND CONCLUSIONS 


The strong Vocational Interest Blank for Men was administered to 
sixty-four high-school boys each year for three consecutive years. 
The mean age of the group was 15.7 years at the time of first testing. 
An interval of one year was maintained between tests. 

All blanks were scored using seven scales regarded by Strong?:’® as 
representative of seven families of occupations. The blanks were also 
scored using scales for interest maturity, masculinity of interests, and 
studiousness. Scores on each scale were converted into letter ratings 
based upon adult norms and standard scores based upon high-school 
group means and standard deviations. 

Letter ratings and standard scores for the two-year period were 
analyzed for evidence of the permanence of vocational interests. 





* This finding can have only descriptive significance, in view of the lack of 
proof or even likelihood that the letter ratings represent equally broad categories. 
The width of the categories is a function of the sampling; hence the finding describes 
an empirical fact. 
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Test-retest correlations for each scale with an interval of two years 
were calculated. Average changes in rating in both upward and down- 
ward directions were computed for each scale, enabling comparisons of 
change on individual scales. Comparison was made of the percentages 
of ratings received at the first testing with the percentage of identical 
and other ratings received two years later. Finally, the constancy of 
each of the five letter ratings for high-school sophomore boys was com- 
pared with the constancy of the letter ratings for two groups of college 
freshmen and one group of college seniors reported upon by three other 
investigators.!:!2.14 

A comparison of the results of this study with data reported by 
Strong,'? Burnham,! and Van Dusen" shows that: 

(1) The vocational interest scores of high-school boys are less stable 
than are the vocational interest scores of recent college graduates. 
Retest correlations vary from .48 to .66 for high-school boys tested 
with an interval of two years intervening; similar coefficients reported 
by Strong!? vary from .59 to .84 for college seniors retested five years 
later. 

(2) For the high-school group, the greatest changes of scores and 
the least stability of interests are shown for the office worker, teacher, 
and Y.M.C.A. secretary scales; these are the scales which correlate 
most with interest maturity. The least change and the greatest 
stability are shown for ratings on the C.P.A., lawyer, and life insurance 
salesman scales. 

(3) The rank order of permanence coefficients among the occu- 
pational scales used is similar for high-school boys and for college 
men. A rank-difference correlation of .71 indicates the extent of this 
resemblance. 

(4) Scores of the high-school boys in this sampling increased less 
often in a two-year interval, and decreased more often than did scores 
of college seniors retested by Strong after five years. In both test and 
retest the high-school group received a slighly higher percentage of 
‘““A”’ ratings than did Strong’s college senior group. 

(5) The percentage (sixty-four per cent) of identical letter ratings 
received by the high-school boys retested after two years is approxi- 
mately the same as the percentage (63.3 per cent) of identical ratings 
received by Strong’s college seniors retested after a five-year period. 
The similarity of the two groups in the percentage of identical ratings 
in two testings appears somewhat dependent upon the greater stabil- 
ity of the ‘‘C”’ ratings for the high-school sophomores. 
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(6) Although the constancy of vocational interests of these high- 
school boys was less than that of Strong’s college senior men, compari- 
son with studies by Burnham! and Van Dusen" indicates that the 
interests of these high-school boys were about as constant as the 
interests of college men tested as freshmen and retested as seniors. 

(7) The Strong Vocational Interest Blank for Men indicates certain 
facts about the interests of high-school boys with considerable reliabil- 
ity. Thus, if a high-school sophomore boy received a ‘“‘C”’ rating on 
the first test, there is an eighty-three per cent chance that he will 
receive the same rating and only a one per cent chance that he will 
receive an “‘A”’ rating two years later. And if he received an ‘‘A”’ 
rating on the first test, there is an eighty-eight per cent chance that 
he will receive a rating of “B”’ or higher two years later. 

(8) In view of the fact that the level of interest test scores of the 
high-school boys did not rise appreciably in the two years between 
tests, it appears that in one respect at least this representative high- 
school group is not becoming more like Strong’s college senior group as 
time goes on. In the attempt to explain this finding, it may be stated 
as an hypothesis that differences often believed to exist between the 
interests of older men, college men, and high-school groups may not 
be due to age changes, but where so regarded may often be artifacts 
resulting from the use of the cross-sectional method, involving the 
comparison of essentially different samplings. 
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AN EXPERIMENTAL ANALYSIS OF THE THEORY OF 
INDEPENDENT ABILITIES* 


ROBERT 8. MORROW 
Brooklyn, New York 


The current theories of abilities appear to favor a theory of inde- 
pendent functions. Thurstone,’ whose theory of primary abilities 
has superseded the Thorndike” multiple factors, subscribes to a theory 
of independence among the primary group factors. Kelley’ assumes a 
position which is close to that of Thurstone by conceiving mental traits 
as constellations of abilities and related phenomena capable of inde- 
pendent functioning. The theory of ‘‘unique traits” posited by the 
psychologists at the University of Minnesota’ agrees quite well with 
the Thurstone primary abilities. Thomson’s sampling theory con- 
siders the mind as made up of an infinite number of neural bonds which 
may exist independently, or as “‘subpools’”’ of the various bonds. 
Tryon” advocates a similar point of view with the gene as the basis. 

In contrast to these theories of independent abilities stands the 
Spearman two-factor theory.'4 Alexander! advocates dynamic inter- 
relationship among the different abilities in a modification of Spear- 
man’s theory. He insists, however, that the interrelationships are 
the resultant of more than one factor. Garrett‘ analyzed the group 
factors found in several investigations and reported that in all cases 
these group factors are interrelated instead of existing independently 
as had been thought. 


THE PROBLEM 


The purpose of the present study, therefore, is to determine whether 
human abilities as measured by special tests are independent or inter- 
dependent and the extent of such relationship. More specifically, 
the problem is to find by means of correlational analysis and the fac- 
torial analysis technique the degrees of relationship among certain 
tests of intelligence, musical ability, artistic judgment, clerical ability, 
mechanical ability, and manipulative ability. 





*Submitted in partial fulfillment for the Ph.D. degree at the New York 
University Graduate School of Arts and Science, June 1940. The writer is pro- 
foundly grateful to Associate Professor T. N. Jenkins for supervising the research 
and to Professor P. D. Stout for making available all the facilities of the Depart- 
ment of Psychology of New York University, Washington Square College of Arts 
and Science. A revised version of this paper was read at the September, 1940, 
meeting of the American Psychological Association. 
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THE EXPERIMENT 


Subjects —Eighty male students from the undergraduate course in 
general psychology at Brooklyn College and at New York University 
were given the complete battery of tests used in this study. The age 
range of these subjects was from fifteen years, nine months to twenty- 
four years, two months, with the median age at eighteen years, four 
months. The socio-economic status of the vast majority of the sub- 
jects was essentially the same, since most of them came from the upper 
working class and lower middle class type of home environment. All 
were born in the United States. They usually had one or both parents 
who were born in this country, and they were mainly of the Hebrew 
cultural background. 

Procedure.—Eight presumably different tests of ability were given 
in a random manner to the subjects. Because of the fact that many 
of the tests could be administered to groups as well as in the indi- 
vidual form, and also because of the fact that a conscious effort was 
made to avoid adherence to a particular sequence of administra- 
tion, the testing procedure varied and was very different for various 
subjects. Whether the tests were taken individually or in groups 
depended upon conditions, such as the number of subjects available 
at a particular time. The groups never exceeded six, and were usually 
two or three in number. Some subjects took all the pencil-paper tests 
in group form; others took some of these tests in group and individual 
form; while most took all the tests individually. 

The testing program was approximately six hours long. Depend- 
ing upon the subjects’ convenience, the program was administered in 
three sessions, two sessions, or six sessions. The procedure most often 
followed, however, was a two-hour session, repeated three times. 
The pattern of administration of the test battery depended upon fac- 
tors which helped create maximum motivation or facilitated the testing 
program. For instance, if the interest of the subject could be main- 
tained by withholding the music or intelligence test until the end, this 
procedure would be used. In many cases it became necessary to 
administer these tests first in order to arouse the curiosity and inter- 
est of the subject. Most of the so-called mechanical ability tests 
were individual tests and were administered mainly at the end of the 


procedure. 
Tests—Many factors entered into the selection of the specific 


testing materials used in this experiment. Among the outstanding 
bases for selection of the material were the extent to which the tests 
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were considered to be distinct and separate measures of the specific 
abilities, the validity and reliability of the tests, the length and diffi- 
culty of administration of the tests, the reputation which these tests 
had among various psychologists,* the use of the tests in previous 
research, and the personal experiences of the experimenter with these 
and other tests. 

The test of intelligence used was the 1938 edition of the American 
Council on Education Psychological Examination for College Fresh- 
men by Thurstone and Thurstone.*! This test consists of six separate 
tests, two subscores and a general score. The Q-score comprises the 
three allegedly quantitative tests—the arithmetic, analogies and num- 
ber series tests; the L-score is the total of the three linguistic tests— 
the completion, artificial language and same-opposite vocabulary tests. 
The general or gross score is the total of the six separate tests. 

The Seashore Measures of Musical Talent’ containing tests of 
pitch, intensity, time, consonance, memory and rhythm were used to 
indicate musical ability. As a test of artistic appreciation the Meier- 
Seashore Art Judgment Test’? was used. The Minnesota Vocational 
Test for Clerical Workers,* containing number checking and name 
checking tests, was used to measure clerical ability. 

Three tests were used to measure the different aspects of mechan- 
ical ability. The Likert and Quasha Revised Minnesota Paper Form 
Board was used to measure the ability to perceive and analyze spatial 
or geometrical patterns in two dimensions. The short form (Boards A 
and B) of the Minnesota Spatial Relations test'!? was used as an 
apparatus test for the measurement of two dimensional spatial rela- 
tions. The short form (Set I and II) of the Minnesota Mechanical 
Assembly Box!? was used to measure mechanical analysis and assem- 
bling ability. 

The O’Connor Finger and Tweezer Dexterity Tests’ were used as 
measures of manipulative ability. The O’Connor manipulative tests 
and the Minnesota Spatial Relations and Assembly Boxes were given 
in the individual form in all instances. 


THE RESULTS 


The Intercorrelations among the Tests—The correlational analysis 
has been almost indispensable as an aid in studying ‘‘mental”’ abilities. 





* Pallister'! conducted a study in which thirty-eight applied psychologists 
were asked their opinions on fifty-three well-known tests. Almost all of the tests 
used in the present study received highly ‘‘efficient”’ ratings. 
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The degree of relationships among the abilities is revealed through this 
technique. Hierarchies of relationships indicate the possible presence 
of common or group factors. 

The intercorrelations among all the variables are shown in Table I. 
The coefficients of correlation range from —.275 between the Minne- 
sota Clerical Name Checking and the Minnesota Assembling Tests to 
.944 between the Thurstone L-score and the Thurstone Gross Score. 
However, if the Q-, L- and Gross Scores are eliminated from the table 
of intercorrelations, as seen in Table II, the highest correlation coeffi- 
cient then becomes .625 between the Number Checking and Name 
Checking tests of the Minnesota Clerical. In view of the fact that 
the Q-, L- and Gross Scores have a somewhat dubious character, it 
appeared more feasible to do the computations with them included as 
well as without them. The presence or absence of the Q-, L- and Gross 
Scores seemed to have little effect on the general results. 

It will be observed from Table II that the correlations are mainly 
positive, although rather low, thereby indicating slight degrees of 
interrelationships among the abilities tested. The highest intercorrela- 
tions appear to be among the tests which are components of the same 
ability. There is, however, considerable overlapping throughout. 

The mean of the correlational matrix, that is, the mean of all the 
correlation coefficients in Table II, is .148. The mean of the correla- 
tion coefficients of the six intelligence sub-tests, as seen in Table ITI, 
is .339. The mean of the music sub-tests is .275. The mean of the 
mechanical battery, which is composed of the two spatial tests and the 
assembling box is .336; the mean of the three mechanical and the two 
manipulative tests taken together is .296. The coefficient of correla- 
tion between the number and name checking tests is .625, and the 
coefficient of correlation between finger and tweezer dexterity is .349. 

That the correlation coefficients are higher for the tests which 
measure the same ability than with tests measuring the different 
abilities may be seen in Table III. In this table are presented the 
mean coefficients of correlation of the tests of the same ability taken 
together as well as the mean correlation coefficients for different com- 
binations of relationship. 

Certain hierarchies stand forth prominently in Table III. The 
relationships between intelligence and clerical number and name check- 
ing, art judgment and mechanical ability, and mechanical and manipu- 
lative ability are most outstanding. In a similar way certain negative 
relationships are manifested. For instance, the manipulative tests 
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show no relationships with any of the abilities, save mechanical ability, 
and, to some extent, art judgment. 

Factor Analysis of the Correlations.—Factor analysis represents a 
highly useful technique for studying organization of abilities. Accord- 
ing to this technique, test performances in a wide variety of situations, 
as in the present experiment, can be reduced to a relatively small 


TaBLe III.—MEAN COEFFICIENTS OF CORRELATION OF TESTS MEASURING ASPECTS 
oF SAME ABILITY AND OF DIFFERENT ABILITIES 
(Based on the Intercorrelations of Table I1) 
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number of experimentally determined reference abilities, such as verbal 
ability, numerical ability, and the like. It is particularly useful in 
cases such as the present where there is considerable overlap among the 
correlations. In the factor analysis technique the vast number of 
correlations are reduced to a relatively small number of fundamental 
abilities. 

The Thurstone centroid multiple factorial analysis” was used to 
determine the loadings for the four extracted factors in Table IV. The 
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Thurstone method of factor analysis is applicable to a large number of 
variables and factors, and is not dependent upon the absence of group 
factors for its application. Moreover, it aims to give a more compre- 
hensive analysis than tetrad analysis of the factors involved since the 
first factor loadings are supposed to give the same information as tetrad 
analysis. 

The symbol h? stands for the communality of the test, which repre- 
sents the sum of the square of the factor loadings for each variable. If 
the communality is equal to unity there are no specific factors present. 
The variance, or what Thurstone calls the ‘“‘uniqueness”’ of a test 
equals 1 — h?. This indicates the extent to which specific factors 
(including the sampling errors) are present. The factor loadings 
squared indicate the percentages of variance in each test attributable 
to each factor. The sum of the factor loadings squared for each factor 
divided by the number of variables (2K?/N) indicates the total 
variance for each factor. This corresponds with the sum of the com- 
munality (h?) divided by the number of variables. 

The factor loadings in both tables remain relatively unmodified 
whether or not the Q-, L- and Gross Scores are included. Since these 
variables were determined through previous factor analysis, they are 
consequently omitted in this one. 

Factor I appears to be a general factor. It is present in all the 
variables, although in varying degrees. Factor I accounts for eighteen 
per cent of the total variance of the tests; Factor II for approximately 
eleven per cent variance; Factors III and IV for seven and four per 
cent, respectively. In all, the four factors account for about 40 per 
cent of the variance. The communalities indicate that from twenty- 
one per cent to sixty-two per cent of the variance of the individual 
variables is accounted for by the four factors. It could, therefore, be 
stated with much certainty that specific factors are present. 

Although all of the variables are positively loaded with Factor I, it 
can be seen that there are large differences in some of the weightings. 
There seems to be no definite hierarchy of high loadings, thereby indi- 
cating that Factor I is a general integrating factor. There are some 
considerably low loadings in this factor. These are for the manipula- 
tive tests, including the Mechanical Assembly Box, which involves 
work with the hands. The Consonance test and the Same-Opposite 
test also reveal low factor loadings. The latter are extremely unreli- 


able tests. 
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TaBLe 1V.—Factor LOApDINGs FoR VARIABLES WITH THURSTONE Q- aND L- AND 


Gross Scores OMITTED 


(Four Factors, Determined by Thurstone Centroid Method) 





















































N = 80 
Factors 
| 
Test I II III IV I? II? III? | Iv? h? 
l .486 .089 441 | —.263 | .236 | .008 | .194/ .069 | .507 
2 .485 .231 260 .172 | .235 | .053 | .068 | .030 | .386 
3 . 566 .426 159 | —.281 | .320/; .181 | .025 | .079 | .605 
5 445 .188 280 .072 | .198 | .035 | .078 | .005 | .316 
6 . 559 .433 |—.011 .252 | .312 | .187 | .000 | .064/ .563 
7 . 299 .312 221 .201 | .089 | .097 | .049 | .040 | .275 
10 566 — .260 |—.333 | —.226 | .320/| .068/ .111/ .051] .550 
il 438 —.194 |—.170 | —.208/ .192/ .038 | .029 | .043] .302 
12 430 .102 |—.096 .049 | .185 | .010 | .009 | .002/| .206 
13 157 —.180 |—.374 .172 | .025 | .0382 | .140] .030]| .227 
14 583 194 |—.453 | —.202 ; .340/ .0388 | .205 | .041/ .624 
15 304 .199 |—.376 | —.190 | .092 | .040/ .141 | .0386 .309 
16 400 — .174 186 .185 | .160 | .030 | .035 | .034 | .259 
17 361 .367 |—.208 .119 | .1380 | .1385 | .043 | .014 | .322 
18 485 489 |—.171 .161 | .235 | .239 | .029 | .026 | .529 
19 505 — .204 |— .066 .199 | .255 | .042 | .004 | .040/| .341 
20 429 — .459 209 .121 | .184} .211 | .044] .015 | .454 
21 190 | —.631 233 | .163| .036| .398| .054| .027]| .515 
22 199 — .459 239 | —.162 | .040 | .211 | .057 | .026/ .334 
23 186 — .473 106 | —.184 | .035 | .224| .O11 | .034 | .304 
Das ch cs tee sie eUCN ERSTE bee 3.619 |2.277 |1.326 | .706 |7.927 
CE nk die atubodaddbbadnuces 4haen .181 | .114 | .066 | .035 | .396 
1. Thurstone Arithmetic 16. Meier-Seashore Art 
2. Thurstone Analogies 17. Minnesota No. Checking 
3. Thurstone Number Series 18. Minnesota Name Checking 
5. Thurstone Completion 19. Minnesota Paper Form Board 
6. Thurstone Artificial Language (Revised) 
7. Thurstone Same-Opposite 20. Minnesota Spatial Relations 
10. Seashore Pitch 21. Minnesota Assembly Box 
11. Seashore Intensity 22. O’Connor Finger Dexterity 
12. Seashore Time 23. O’Connor Tweezer Dexterity 
13. Seashore Consonance 
14. Seashore Tonal Memory 


Seashore Rhythm 
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Factor IT has very high negative loadings for all the mechanical and 
manipulative variables. The highest positive loadings are for the 
Clerical Number and Name Checking tests, and the Thurstone Number 
Series, Artificial Language, and Same-Opposite tests. In a centroid 
factor analysis negative and positive signs may be changed arbitrarily 
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Fic. I.—Correlation between Factors I and II. (Variables Q-, L-, and Gross Scores 
omitted.) 

without altering the results. Factor II would, therefore, appear to be 
a factor related to mechanical and manipulative ability. The previ- 
ously indicated hierarchy between Clerical Number and Name Check- 
ing as well as the relationship between these abilities and some of the 
intelligence sub-tests appears to be verified. It would seem then that 
Factor II is comprised of two subfactors. One combines the group 
factors of mechanical and manipulative abilities and the other embraces 
to an appreciable extent the intelligence and clerical group factors. 








AEE co ar 











tins is Cae reli 





The Theory of Independent Abilities 505 


Factor III apparently has its highest correlations, on the whole, 
with the tests of musical ability if the signs are interchanged. There 
seems to be a slight correspondence between musical ability and clerical 
ability, as well as between intelligence and mechanical-manipulative 


ability. 
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«III 
Fie. I1.—The relations among Factors I, II, and III. (Axis I considered as being per- 
pendicular to the surface of the chart.) 


Factor IV shows few high loadings. Some of the facts revealed in 
this factor, nevertheless, are that the Analogies test may be considered 
as much verbal as it is quantitative. Also revealed is the consistent 
relationship between the Art Judgment test and the mechanical and 
manipulative tests. Also, the relationships, albeit low, between 
clerical and art judgment, and clerical and mechanical abilities are 
disclosed. 


506 The Journal of Educational Psychology 


The degree of relationship between factors can be determined by 
finding the cosines of the angles between the vectors which are the 
expressions of the factor loadings. Figure I represents a two-dimen- 
sional relationship between Factors I and II, with complete disregard 
as to the remaining factors. The variables from Table IV are plotted 
with respect to these two factors. Line P; is the average of the positive 
coérdinates and P, the average of the negative coérdinates. The 
angle (¢) between the two vectors is approximately 76 degrees. The 
correlation between Factors I and II, which is the cosine of a 76-degree 
angle, is .242. 

With respect to a three-factor relationship it is necessary to describe 
the variables in a tri-dimensional structure. That is, the factor load- 
ings are plotted on the surface of a sphere. Factor IV appears to be 
relatively insignificant and is, therefore, omitted. Figure II gives a 
bird’s-eye view of the pattern of the augmented factor loadings plotted 
on the surface of a sphere, axis I being considered as perpendicular to 
the surface of the paper. From this figure it can be seen that simple 
structure is not obtained because of the difficulty in confining the load- 
ings within a right spherical triangle. Distinct and clearcut clusters 
are absent, revealing minor ones only. The lack of simple structure 
makes it difficult to measure the relations among the three factors. It 
is apparent, however, that the factors are not orthogonal. Therefore, 
Factors I, II, and III are apparently interrelated. 


DISCUSSION OF RESULTS 


The positive correlations among the abilities tested, the tendency 
toward hierarchical formation, the overlapping among the correlations 
and the factor loadings, and the correlations among the factors which 
are present in these abilities, tend to demonstrate that abilities instead 
of existing independently are in dynamic relationship with one another. 
Positive correlations are reported in almost all biometrical and psycho- 
logical studies. However, no further consideration is given to this fact 
other than to submit as an explanation the rather vague hypothesis 
that natural selection favors positive correlations or that ‘‘ desirable 
qualities in mankind tend to be positively correlated.”” Even if 
Thomson were correct in this Lamarckian hypothesis, it ought not to 
preclude the concept of functional relations of the abilities. Yet his 
sampling theory seems to exist in total disregard of this fact. The 
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consistent appearance of positive correlations between abilities, even 
though the correlations are often extremely low, indicates that 
the explanation must extend beyond the simple realm of chance 
relationship. 

In a study involving clerical number and name checking, Andrew? 
reported all correlations as positive among the tests of various abilities. 
This indicated “‘the existence of a general factor, while relatively high 
correlations indicate the presence of group factors, or in other words 
overlapping.”” She stated also that multiple factor analysis “ 
indicates that a common factor runs through all the tests and in addi- 
tion there are minor group factors.’”’ Despite these statements 
Andrew’s general conclusion was that the Minnesota Clerical Test “‘is 
measuring a specific ability which is relatively independent of spatial, 
academic, and dexterity abilities.” The existence of hierarchies among 
the abilities was given little consideration except by Spearman and his 
colleagues. 

The correlations among the factors found in the present investiga- 
tion seem to be confirmed in other studies. Garrett‘ found in multiple 
factor analyses of the results involving different abilities that positive 
correlations existed between the group factors which had been isolated. 
Thus, in one analysis, he found a correlation of .225 between the 
numerical and verbal factors. In analyzing a second group of data he 
reported respective correlations of .825 between the verbal and numeri- 
cal factors, .273 between the verbal and non-language (performance) 
factors, and .296 between the numerical and non-language factors. He 
found in analyzing the Anastasi studies, however, that the memory 
factor was independent of both the verbal and numerical factors. The 
angles were orthogonal, the correlations being .00 and —.085 between 
the memory factor and the numerical and verbal factors, respectively. 

Murphy’? found that the primary traits, in the investigation of the 
relation between mechanical ability and intelligence, were “oblique 
rather than orthogonal,’’ while Morris,’ in contrast, reported that the 
“mental traits ... are orthogonal, or in other words, they are 
independent mental traits.” 

The results obtained here indicate the presence of specific and group 
factors which are coérdinated through the presence of a common factor. 
In this respect, these findings seem to be analogous to those of Spear- 
man. The specific factors are revealed in the rather low communali- 
ties while the group factors are shown in the isolated factors. Factor I, 
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the configurational or coérdinating factor, integrates the abilities into 
ordered activity. 

It is not at all clear whether the g-factor of Spearman is similar to 
this configurational factor, since Spearman has never committed him- 
self specifically on this point. It does appear, however, that the g-fac- 
tor is the sum of all the factors instead of one which integrates the 
mental functions. 

On the basis of this, therefore, the specific factors are separate and 
independent. They are related only through the additive process, the 
g-factor, which is present to some degree in all the specific and group 
factors but does not codrdinate them. Thus, the Spearman two-factor 
theory represents a static system and is apparently incomplete for 
explaining the results obtained here. More adequate agreement with 
the results, however, is found if one accepts Alexander’s modification 
of the theory whereby he emphasizes the inseparable quality of the 
group factors as well as the interrelationship among these factors. 

The “primary abilities’ of Thurstone and the “unique traits”’ of 
the University of Minnesota group are representative of a mechanical 
and atomistic explanation. The independent existence of abilities 
according to these explanations is indicative of a behavioral anarchism 
whereby each ability is isolated and specific. That abilities are by no 
means absolutely specific and diverse is apparent in the existence of 
considerable overlapping of function. Thurstone contradicts the 
concept of independent functions by reporting oblique rather than 
orthogonal relations among the factors in several studies.” 

If what is commonly called mental activity is regarded as the 
ordered and integrated expressions of the total personality, then these 
expressions as a consequence must be considered as in functional 
relationship instead of static and isolated phenomena. Modern 
psychological knowledge indicates that the behavior of the organism is 
not a congeries of disparate faculties but rather an organismic unity in 
which there is a dynamic relationship between functional and structural 
aspects. The performance of an act by an individual involves the total 
personality. Thus excellence in a single ability can not be attributed 
to a specific faculty of some kind. Such an explanation represents an 
evasion. An explanation more in keeping with most recent experi- 
mentation would attribute such behavior to circumstances which favor 
the appropriate combination of environmental and hereditary factors 
expressed through the total organization of the personality. 
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Insight into the organization of abilities is offered by the factor 
analysis technique. Practically all of the experimental analyses 
contradict the conception of independence among human abilities. 
Thurstone asserts, ‘“‘ The cosine of the angular separation of each pair of 
primary trait vectors is the correlation between the corresponding 
primary traits in the experimental population. It will probably be 
found that these correlations are positive.” In The Vectors of Mind 
he indicates several studies with positive correlations among the 
“traits.” In a separate study of vocational interests, after having 
factored out eight primary interests, Thurstone states, ‘‘These refer- 
ence factors are not all uncorrelated. Several of them have intercor- 
relations of .25 or .30 in the experimental population but most of them 
have zero correlations.’ ® 

Although the theory of primary mental abilities seems to have been 
founded upon the idea that abilities are independent (orthogonal), 
Thurstone has of late given definite indications of deviating from this 
belief. He says, ‘‘Among statisticians and psychologists there is a 
rather general belief that if human traits are to be accounted for by any 
kind of factors, then these factors must be uncorrelated. The geo- 
metrical representation of uncorrelated factors is a set of orthogonal 
reference vectors. This belief has its origin in the statistical and 
mathematical convenience of uncorrelated factors and also in our 
ignorance of the nature of the underlying structure of mental traits. 
Since we know so little about them and since it is statistically con- 
venient to use uncorrelated reference traits, the insistence on orthogo- 
nality can be understood, but it cannot be justified.”” He states at 
various times that factor analysis “" ... assumes that a variety of 
phenomena within the (mental) domain are related and that they are 
determined, at least in part, by a relatively small number of functional 
unities, or factors’’; or that “‘ . . . mind is not a patternless mosaic of 
an infinite number of elements without functional groupings”’; or ‘‘The 
factors are probably functional groupings, and it is a distortion to 
assume that they must be elemental’’; and finally, “‘ . . . the results 
point to the conclusion that mind is not a structureless mass, but that 
it is structured into constellations or groupings of processes that can be 
identified as distinct functions in the test performances. These are 
what I have called primary mental abilities or traits.’ 

That these primary abilities are independent at least in part still 
remains the thesis of Thurstone. In opposition to this, however, it is 
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recognized that human abilities do not follow the so-called all-or-none 
hypothesis. Instead, the presence of these abilities is shown to exist 
in all cases in varying degrees, depending upon different hereditary and 
environmental factors. These abilities are dynamic expressions of the 
total personality; hence they exist in functional relationship to each 
other. This appears to have been shown in the present investigation. 


SUMMARY AND CONCLUSIONS 


Eighty relatively homogeneous male college students were given in 
a random manner the Thurstone American Council on Education 
Psychological Examination for College Freshmen, the Seashore 
Measures of Musical Talent, the Meier-Seashore Art Judgment Test, 
the Minnesota Vocational Test for Clerical Workers, the Likert-Quasha 
revision of the Minnesota Paper Form Board, the short form of the 
Minnesota Spatial Relations Test, the short form of the Minnesota 
Assembly Test, and the O’Connor Finger and T weezer Dexterity Tests. 
In summary the salient findings of the present experiment seem to be 
the following: 

(1) The intercorrelations among the variables are on the whole 
positive but low. There is considerable overlapping throughout the 
intercorrelations. The highest intercorrelations appear among the 
tests which bear the name of the same ability (7.e., the intelligence 
tests). Hierarchies among certain of the abilities are apparent. 

(2) Four factors are found by means of the Thurstone “center of 
gravity” technique, of which three seem to be important. Factor I 
seems to be a general, integrating factor. Factor II seems to be 
made up of two “‘subfactors,’’ one combining the group factors of the 
mechanical and manipulative abilities and the other combining the 
intelligence and clerical group factors. Factor III indicates relation- 
ships between musical ability and clerical ability as well as between 
intelligence and mechanical ability. Consistent relationship between 
art judgment ability and mechanical ability appears in the factors. 
There is considerable overlapping among the factors. They seem to be 
interrelated instead of completely independent of each other. 

By virtue of these findings, it would appear that the Spearman and 
Thurstone theories are inadequate for explaining the relationships 
expressed in this study. Rather must one conclude with the hypothe- 
sis that the abilities here tested are not disparate and static abilities, 
but that they are, instead, functional and dynamic relationships within 
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total personality. This organismic conception seems to be in 


closer conformity to modern psychological theory than the previously 
reported atomistic hypotheses. 


(1) 
(2) 
(3) 
(4) 
(5) 


(6) 


(13) 


(14) 
(15) 
(16) 
(17) 


(18) 
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IRREGULARITIES OF UNIVERSITY STUDENTS ON 
THE REVISED STANFORD-BINET* 


MILDRED B. MITCHELL 


Psychologist, Bureau of Psychological Services, Minnesota 


INTRODUCTION 


While making a study of the differences between the 1916 and 1937 
editions of the Stanford-Binet,’ the writer noted many irregularities 
of performance on the 1937 edition, not only among the Psychopathic 
Hospital patients, but also among the university students. This 
paper will be limited to a study of the irregularities found among the 
students, and a later article will give an analysis of irregularities for 
psychopathic subjects. Sixty-seven university freshmen and eighty- 
six senior medical students at the State University of lowa were used. ft 

There seemed to be some inadequacies in the new test, as well as 
in the old, at the higher levels. For instance, only one senior medical 
student and only eighteen freshmen failed all tests at the Superior 
Adult III level. This, along with the selection of students, tended 
to decrease the range for the medical students and probably explains 
the correlation of only .15 found between the medical students’ grades 
for four years and their IQ on the Revised Stanford-Binet, in contrast 
to the .64 found for university freshmen between their grades and 
IQ.6 When the sigma is increased statistically to that of the freshmen 


¢Vl-r 
by the formula Tink 
becomes practically as high as for the freshmen; namely, .60. 

Shakow and Harris* have summarized the work on scatter, using 
the 1916 edition of the Stanford-Binet and give an excellent bibliog- 
raphy. Harriman? reported irregularities on the Revised Stanford- 
Binet for fifth- and sixth-grade children averaging eleven years, seven 
months in chronological age. He found, for instance, that the x11 
year level was easier for his subjects than either the x11 or xIV year 
levels. Carlton! also found year x1 easier than year xu for two 


hundred fifteen mental defectives having a mean chronological age 





the r with grades for the medical students 





* This paper was read at the American Psychological Association meeting, 
Pennsylvania State College, 1940. 
+ The writer wishes to thank Miss Mildred Heald, Dr. Dewey Stuit, and Dean 
Lonzo Jones for their codperation in making the freshmen available. 
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of fourteen years, two months. As will be seen later, this is in marked 
contrast to our results for adult subjects. This is not surprising, of 
course, since Pressey and Cole’ pointed out over twenty years ago that 
the relative difficulty of various tests (using Yerkes Point Scale) was 
different for adults than it was for children. 

One of the outstanding differences between the 1916 and 1937 
editions of the Stanford-Binet is in the frequency of multiple bases on 
the later edition. No doubt, this is due in part to the larger amount 
of material with smaller differences in difficulty between adjacent 
levels. For example, the differences between years x11 and xIv would 
be expected to be greater than those between x11 and x11. (There 
was, of course, no x11 year level on the 1916 edition.) Multiple 
bases were so infrequent on the 1916 edition that Shipley® in making 
studies of scatter on that edition omitted such cases. 


METHOD 


Since we did not originally plan to study scatter, we may not always 
have obtained the maximum results. We started the testing by giving 
the Vocabulary, Digits Forward, the Plan of Search, and Digits Back- 
ward.* It was found quite accidentally that students sometimes failed 
tests other than the Plan of Search at the x11 year level even though 
they passed all tests at the x1v year level. Since the writer was giving 
the tests partially as a teaching device in connection with a course in 
‘‘Psychometrics in Psychiatry” for senior medical students, the non- 
verbal tests at year x11I were given below the basal year, just to illus- 
trate the variety of material. When it was discovered that this 
material was sometimes failed, it was always given. Therefore, in 
most cases all tests were given from years x11 or x11 through Superior 
Adult III. In seventy-eight per cent of the cases, the testing went at 
least as low as year XIII. 


RESULTS 


A level is considered the lowest base if no tests are failed below it, 
or at it, but at least one test is failed in the level above it. All levels 
above that on which there are no failures are also counted as bases, 
even though there may be two in succession. For instance, cases in 





* The longitudinal method recommended by Wells* for psychotic patients 
is more satisfactory for normal adults also than the vertical method Terman and 
Merrill’? use when working with children. 
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which the Plan of Search was failed at year x11, but nothing else was 
failed below Superior Adult I are considered as having the lowest base 
at year x11 and two other bases; namely, years x1v and Average 
Adult. Figuring in this way, the mean number of bases for the fresh- 
men was 1.5 and for the senior medical students 2.0. The range was 
from one to five for both groups. Thirty, or forty-five per cent, of the 
freshmen and forty-eight, or fifty-six per cent, of the medical students 
have more than one base. 

The following tabulation shows the number of bases for freshmen 
and senior medical students: 














Bases Freshmen Medical 
students 
ie i Ne i ig eos ee ae ee ee A ee og 37 38 
eR a dee A oS eek ions const Beetee taek ew 20 22 
eee oe ot oe ood pene ee eae de eu bade weeke es scwant 8 15 
es ein ara 6 Eels eas Dee eR ab ees Dew s a ke Slee | 9 
Oe ai ar ae sch tae ea a SE ae : i l 2 
atk at ine <6 een 60h ewe aaeee FS ae | 1.5 2.0 
As > adn lb ciiinn ahh be Soe eaee Ga ws OREO AS oe 40098 | 1ltod 1 to 5 
NS SS SFR PROC PCT CR FOU ECRTTTCE | 45 56 








Table I shows the location of the bases and the per cent of times 
each base is lowest. For example, year xIv is a base a total of eighty- 
five times, but only in twenty-four per cent of these times is it the 
lowest base. In other words, if a base is obtained at fourteen years, 
there are more than three chances in four that there is a lower base, 7.e., 
that one or more tests will be failed below that level for this population. 
When it is a base, it is the lowest base thirty-two per cent of the time 
for the freshmen and only seventeen per cent of the time for the senior 
medical students. Similarly, when Average Adult is a base, it is the 
lowest base only twenty-eight per cent of the time. In contrast, when 
year XII is a base, it is the lowest base ninety-eight per cent of the time 
and year x11, fifty-seven per cent of the time. (Perhaps year x1 would 
not have been the lowest base so often if we had always given the tests 
at year XI.) 

Multiple bases were found, as noted above, for approximately 
half of our subjects. Multiple tops are also often found on the 
Revised Stanford-Binet, (Carlton! found more than one top for 18.6 
per cent of his two hundred fifteen mental defectives) but with a 
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college population, they are relatively infrequent because the test is 
not sufficiently difficult to get even one level of failures on many 
subjects. Only one of our senior medical students failed all tests at 
Superior Adult III. Similarly, no top was found for forty-six, or 
sixty-eight per cent, of the freshmen. However, four freshmen passed 
some tests after a level of failures. 


TaBLE I.—NvuMBER oF Times Eacu L&EvEt Is a Base AND PER CENT or THEsE ' 


Times It Is THe Lowest Bask, FOR FRESHMEN, SENIOR MEDICAL STUDENTs, 
AND TOTAL 


























































































































| Level 
| . ee | | 
| Superior 
odes Aver- | 
age 114/13} 12/11/10} 9 | 8] 7] 6 
: bet 
Ill 11 | "Te lao & | 
67 Freshmen 
* l ie 
Number........|.. |..] 8] a1 |38| 7} 21/11] 1] 8 3} 1] 1 
Per cent........ F-- |. | 50 | 45 |32|71| 95 | 82 |100 | 88 | 67 }100 |100 
i ! 
84 Senior Medical Students 
Number........ 3/3/28; 32 |47| 7 7/ 1] 3] 2 
Per cent........ 0/0 all 22 117/43 100 100 | 0 {100 |100 
Total 153 
Number........ | 313| 36] 43 85| 14 62 | is} 2/10/ 5] 1] 1 
Per cent........ | 0} 0| 52} 28 | 24/57] 98 | 89 | 50 | 90 | 80 |100 |100 



































Since many clinical psychologists and psychiatrists lay a great 
deal of stress on range (Age-Level Scatter), the problem of multiple 
bases is an important one. Should the range include all the bases or 
just the highest one? Should it include all the tops or just the lowest 
one? In other words, is there a reliable difference when it is figured 
from the lowest base through the highest top and when figured from 
the highest base through the lowest top? The results for forty-eight 
medical students with more than one base and for thirty-two freshmen 
with more than one base or top are given in the following tabulation: 
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From lowest From highest ; ; 
base to highest | base to lowest | Ratio of a 
Students top top ence to the pro 
able error of 
Mean o Mean o cieesenee 
eis ine aden es ee 1.64 4.2 3.4 
Senior medical students....| 7.2 | 1.34 3.8 . | 3.9 
| | | 











The range for both groups is nearly twice as long when all passes and 
failures are considered as when the usual method of giving a Binet test 
is used, 7t.e., from one level of passes to one level of failures. The 
difference is statistically reliable, the ratio being 3.4 for the freshmen 
and 3.9 for the medical students. 

The next point to be considered is whether there is a significant 
difference in IQ when the two methods are used. The tabulation 
below gives the results: 




















From lowest From highest ; 
base to highest | base to lowest Ratio of differ- 
a oe top top ence to the prob- 
; able error of 
diff 
Mean o Mean o — 
Freshmen (N = 32).......| 114.53 | 12.48 | 116.59 | 12.03 87 
Senior medical students 
yk a ee 130.83 9.36 | 133.23 | 9.48 1.89 














While the mean difference in IQ is only about two or three points, and 
the difference can hardly be considered statistically significant, it 
should be pointed out that differences in individual scores ranged from 
—10 to +13 IQ points. Only four freshmen’s and no senior medical 
students’ scores were raised by using the longer range. This is due 
to the few tops found, as mentioned above. The mode for each group 
was a decrease of one point when the long range was used. 

The multiple bases suggest that for this group of subjects, the levels 
are not in order of difficulty, or, at least, some of the individual tests 
are not placed with other tests of the same difficulty. First, let us 
consider the relative difficulty of the levels as a whole. The following 
tabulation shows the number of errors made by one hundred fifty-three 





518 The Journal of Educational Psychology 


university students at each of the levels from x11 to Superior Adult ITI 
on the Revised Stanford-Binet Form L. Except for levels x11 and 


LEVEL ERRORS 
AR eee Ries oes Ma «be bie bc 0s bode ahs e409 4 06s ess 42 
RNR A SENG Ree ea) a ee ey ee 118 
NE EE SEE OP OT EEE Oe Tee Te PEE Tee Te Tee T 47 
Cee ee ee ce alas S vaak Oe wie 146* 
es a dai rs 5 kui a ve s0b 4 anend ens eeer vaca 272 
CR i aE ae A, ar mPa 471 
ES oe bee oo de cabs ka kheeneeceee 559 


* 6¢ of 195 to correct for extra number of tests at this level. 


xiv, there is a constant increase in number of errors at each successive 
level. The differences between x11 and xiv, however, are very slight. 
It is of particular interest to note that Harriman, working with younger 
subjects, also found irregularities at these levels, but in the opposite 
direction. His subjects passed more tests at years x11 than at either 
years x11 or xIv. He writes,” ‘‘The test items at year level x11 seem to 
be more difficult than those at x11.” 

Let us now consider the individual test items at these three levels; 
namely, x11, x11, andx1v. They are shown for one hundred fifty-three 
university students in rank order of difficulty and number of errors 
for tests at levels x11, x11, and xrv in the following tabulation: 








Rank Errors | Test number 
- sd a —|—— —_— 
l 59 | xill-1 
2 | 31 XIv-2 
3 24 x111-6 
4 15 x111-3 
5 14 x11-4 
6 | 12 x11-6 
7.5 1] XI1I-2; x11-3 
9 8 xiv-4 
10 5 | x111-4 
12 4 XIv-3; x111-5; x11-2 
14 3 | XIv-5 
15.5 I xIv-1; x11-l 
17.5 0 | xiv-6; xu-5 





Test xu-1, Plan of Search, which heads the list, is not only failed 
nearly twice as often as any other test at these levels, but is failed more 
frequently than any other test below the Superior Adult I level. Out 
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of the fifteen most difficult tests at all levels, it is the only one below 
Superior Adult I; it ranks twelfth in number of errors. Previous 
experience had also shown it to be failed more frequently by psychotic 
patients than many tests placed higher in the scale (both on the 1916 
and 1937 editions). The next most difficult test at these levels for our 
subjects is the Induction, x1v-2. It is the only test, however, among 
the first eight that is located at the xrv year level. The next two most 
difficult tests are the other two nonverbal tests at the x11 year level; 
namely, the Bead Chain and Paper Cutting. They are also failed 
more frequently than several of the tests at the Average Adult level. 
In fact, the Bead Chain ranks ahead of half the tests at Average Adult 
in number of errors. It should be noted that none of the four most 
difficult tests at these levels, the Plan of Search, Induction, Bead 
Chain, and Paper Cutting, are purely verbal. 

At the bottom of the list in the preceding tabulation we find the 
purely verbal tests, Abstract Words, x11-5 and x1v-6, which were not 
failed by any subjects. In fact, the Abstract Words tests were the 
only tests above year x which were not failed by at least one of our 
subjects. Harriman? found the Abstract Words easier for children 
than some of the other tests at the same level. He says, ‘“‘There is a 
discrepancy between the ‘Messenger Boy’ item (xI-3) and the 
‘Abstract Words’ (xu1-5). The former item was passed by forty per 
cent of the pupils, whereas seventy-five per cent correctly defined the 
abstract words.’”’ (Only eleven, or seven per cent, of our subjects 
failed x11-3.) 

On the whole, then, year x11 is more difficult for our group of sub- 
jects than either years x11 or xIv and the Plan of Search, x111-l, is a 
particularly hard test. However, without it, the total number of 
errors for year x11! is still more than for either of the other two levels. 
Of the seventy students who were given all tests at years x, x1, and 
xiv, and had no errors at year xIv, eighty-six per cent had errors at 
year xilIl, twenty-nine per cent had errors at year x11, and twenty- 
six per cent had errors at both years x11 and x1v. The nonverbal tests 
are relatively more difficult for the adult subjects than for children. 

Year levels x11, x11, and xiv are not the only places where there 
are discrepancies between the results Harriman obtained for children 
and the results we obtained for adults, as will be seen if we look at the 
individual items higher in the scale which he points out as showing 
irregularities in the Revised Stanford-Binet. The most outstanding 
discrepancy is found on the Vocabulary test. None of the children 
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passed the Vocabularly test at Superior Adult I, but all the senior 
medical students and sixty-seven per cent of the freshmen passed it. 
The following tabulation shows the per cent of successes on certain 
tests: 








Superior Adult Fifth and sixth | University freshmen| Senior medical stu- 
Test grade (N = 200) (N = 67) dents (NV = 86) 
I-1 0 67 100 
I-2 30 60 86 
I-3 5 43 57 
II-6 15 31 28 
ITI-4 15 22 36 














The “‘ Enclosed Boxes,’’ Superior Adu!t I-2, was the next easiest in this 
list for our subjects and the easiest for the children. Harriman con- 
cluded, ‘‘ Probably, therefore, the ‘Enclosed Boxes’ item is placed too 
high up in the scale, for it was solved by thirty per cent of the pupils.’’? 
For our subjects, it was not too easy as compared with the other tests 
at the same level. It ranked fourth in difficulty among the six tests at 
Superior Adult I and was passed by fewer students than the Vocabu- 
lary at the same level. 

Although our results for university students do not show the same 
relative difficulty for particular items as Harriman found for children, 
they show even greater irregularity between items. From the few 
items Harriman publishes in detail there seems to be little correlation 
between the rank difficulty of the items for children and adults. On 
the other hand, there is a rank correlation of .88 between the difficulty 
of all items from x11-1 through Superior Adult III-6 for our university 
freshmen and senior medical students. The most conspicuous differ- 
ence between our two groups was in Vocabulary. No senior medical 
student failed the Vocabulary at Superior Adult I, but it ranked about 
midway (rank 22.5) in difficulty for the freshmen. The Vocabulary 
appears to be relatively easy for university students. As we have 
previously pointed out® seventy-six per cent of the medical students 
passed the Vocabulary test at Superior Adult ITI. 

Although the total number of errors was considerably greater for 
Superior Adult III than for Superior Adult II, when the test items are 
all ranked in order of difficulty, half of the first six come from Superior 
Adult ITI and half from Superior Adult II. It is not surprising, there- 
fore, that we find multiple tops as well as multiple bases. 























University Students and the Revised Stanford-Binet 521 


SUMMARY AND CONCLUSIONS 


The irregularities of performance of sixty-seven university fresh- 
men and eighty-six senior medical students on the Revised Stanford- 
Binet Form L have been studied. In the first place, about half of 
the students obtained more than one base. The freshmen averaged 
1.5 bases and the medical students averaged 2.0 bases. Years xiv, 
xu, and Average Adult, respectively, were bases more frequently than 
any other levels. Although year xIv was a base more frequently than 
any other level, it proved to be the lowest base in only about one- 
fourth of the cases. It is, therefore, impossible to assume that an 
adult subject will pass all tests below year x1v merely because he passes 
all tests at that level. Neither can it be assumed that a subject will 
pass all tests at year x11 merely because he passes all tests at Average 
Adult. The nonverbal tests at year x11 are all failed more frequently 
by the university students than several of the verbal tests at Average 
Adult. 

The IQ’s and ranges were found from the lowest base through the 
highest top, and from the highest base through the lowest top. The 
range was nearly twice as long for the lowest base through the highest 
top as for the highest base through the highest top, and the difference 
was statistically reliable for both groups of subjects. The mean 
difference in IQ, however, was only about two points and was not 
statistically reliable. It seems worth while, nevertheless, to make 
complete examinations in order to obtain a better picture of the sub- 
ject’s abilities and weaknesses. 

Year level x11 was more difficult for our subjects than either years 
xl or x1Iv. This was in marked contrast to what Harriman? and 
Carlton! found for normal and feeble-minded children respectively. 
They found year x1 easier than either years xu or x1v. This is prob- 
ably due to the nonverbal material at year x11 which is easier for the 
children than for the adults. Our results agreed with Harriman’s not 
only in finding gross irregularities for these levels, but also for indi- 
vidual test items at the higher levels. These were fairly consistent 
between our two groups of subjects, but seemed to be radically differ- 
ent from Harriman’s younger subjects. The Vocabulary test showed 
the maximum discrepancy. It was the easiest at Superior Adult I for 
the university students and the hardest for the fifth- and sixth-grade 
pupils. On the other hand, both adults and children found Abstract 


Words relatively easy. 
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In testing young adults of at least average intelligence with Form L 
of the Revised Stanford-Binet, more than one base may be expected. 
Year xiv is particularly unreliable as a true base. Unless all the non- 
verbal items at year xIII are passed in addition to all tests at year 
XIV, it is unwise to assume that the true basal has been reached. In 
most cases it will be necessary to go at least as low as year XII or xIII 
to obtain a reliable base. 
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THE EFFECT OF PRACTICE WITH KNOWLEDGE OF 
RESULTS UPON PITCH DISCRIMINATION 


EARLE CONNETTE 
Assistant Professor of Music, North Texas State Teachers College 


Beginning with the early works of Carl Emil Seashore and his 
students at the State University of Iowa,’ it has been generally held 
that the ability to discriminate pitch is a stable constant of an indi- 
vidual’s ‘‘inherited structure”? which can be determined rather easily 
with proper measuring technique. One of the questions which has 
grown out of the attempts to measure musical talent is: Can pitch 
discrimination be improved with practice? The consensus of experi- 
menters and others who have written empirically seems to be that pitch 
discrimination cannot be improved with practice. In the writer’s 
opinion, however, the evidence from the experimental studies reported 
up to January 1, 1941, is not conclusive enough to justify a dogmatic 
negative answer and the conclusions drawn by those writers who have 
used these data deductively can be disregarded entirely. James L. 
Mursell® says; ‘‘ All things considered we must regard the claim that 
pitch discrimination is a function which depends directly upon inher- 
ited structure and so cannot be influenced or improved by training as 
an unproved assumption.”” This statement is diametrically opposed to 
Seashore’ who, in 1910, wrote: ‘‘When the proximate physiological 
threshold has been reached practice is of no avail. . . . In the majority 
of cases it is possible for the ingenious experimenter to discover the 
proximate physiological threshold to a fair degree of certainty in a 
well-planned half-hour individual test. . . . The mere detection of 
pitch difference . . . is a simple process, requiring only the slightest 
amount of training.’”’ The writer believes this view was derived, in 
part at least, from the work of Buffum described in the same study from 
which the above quotation was taken. Buffum gave twenty practice 
periods to a group of twenty-five eighth-grade pupils. At the end of 
the series he concluded there had been no practice effect in any group 
whose performances were good, medium, or poor at the outset, and 
practically no change in the relative positions of the subjects. 

Franklyn O. Smith* conducted a similar study with four hundred 
seventy-six subjects, using twelve training periods. Whereas slightly 
over a majority of the subjects showed some evidence of improvement, 
he concurs essentially in Seashore’s view as quoted above. In making 
this interpretation, he relies upon a distinction Seashore made between 

523 





524 The Journal of Educational Psychology 


a “‘cognitive”’ and a “ physiological’ threshold. The cognitive thresh- 
old is thought to be higher than the physiological because of such 
factors as “lack of information, best form of attention, interest, and 
effort.” In another group of subjects, Smith found that preliminary 
instruction, including illustration of pitch difference, improved the 
threshold. Hence the improvement found during the practice series 
is regarded as a “‘cognitive’’ development. Those who failed to 
improve are regarded as exhibiting a physiological threshold in the 
first trial. 

In an early study, Guy M. Whipple” investigated the possibility of 
improving pitch threshold with practice, giving one subject ‘‘several 
experimental hours” of coaching with negative results. He was of the 
opinion, however, that the practice series was too short to demonstrate 
the impossibility of improvement. 

Cameron! used six subjects in an experiment which tested the effect 
of practice in singing certain tones upon accuracy of discrimination and 
reproduction. Four of the subjects showed marked improvement in 
both functions when the test consisted of the tones specifically 
practiced. 

Hazel M. Stanton’ reported results upon repetitions of the Seashore 
pitch discrimination test in cases where musical training intervened 
between trials. An improvement in percentage of correct judgments 
was found in all such situations, the improvement being greater in ratio 
to longer periods of training. She is inclined to regard the change as 
one in cognitive conditions consequent upon general maturation rather 
than on the musical training. 

In a study of the reliability of the Seashore Measures of Musical 
Talent, McCarthy‘ gave four presentations of the pitch discrimination 
test to a group of school children. A comparison of the first results 
with the fourth showed no improvement. In this respect the pitch 
discrimination test differs from the others of the Seashore battery used 
by McCarthy. 

Cognate to the present problems are a number of studies on absolute 
pitch recognition. Gough? found that practice produces a marked 
improvement in judgment, regardless of initial ability. The longer the 
practice continued the greater the reduction of error for all subjects. 
Mull’ concluded from a study of training for absolute pitch discrimina- 
tion that such ability could be developed in the average subject. 

In a study of the effect of training upon the upper limit of hearing, 
Humes’ found rather surprisingly that the upper absolute limen 
actually decreased with practice! 
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In the interest of the present study, the foregoing summary seems 
important and these studies have been reported here for that reason. 
In the present study, two questions are foremost: (1) What predictive 
value lies in the usual measure of pitch threshold? and (2) is there a 
valid distinction between the physiological and cognitive thresholds? 
Contributions to the answers to these questions were derived from a 
series of experiments in which conditions highly favorable to practice 
effect were obtained. In the studies made previously, dealing directly 
with the effect of practice upon pitch discrimination, it is implied that 
the subjects were not informed regarding the correctness of their 
judgments. According to usual findings in learning experiments a 
relative small amount of improvement could be expected under these 
circumstances. As an essential feature of the present experiment, 
therefore, the subjects were informed of the relation of the tones of the 
test immediately after each trial. In this one respect the present study 
is entirely different from any other reported up to January 1, 1941. 


APPARATUS AND PROCEDURE 


In the experiment, tones for discrimination were produced by 
Stoelting forks, mounted dually in wooden resonators. The standard 
fork was 440 d.v. Seven comparison forks producing tones higher 
than the standard by .5, 1, 2, 5, 17, and 30 d.v., respectively. First to 
learn the best procedure and to standardize instructions as well as to 
perfect technique, the entire experiment was first conducted with a 
group preliminary to the main experimental group. In other words, 
the entire experiment was rehearsed first with a group of subjects not 
included in the main experimental group upon whose records the 
conclusions of the present study are based. This procedure eliminated 
many errors that might have otherwise entered into the study. 

The subjects were all tested individually, seated out of sight of the 
forks about a yard distant. Before presenting each pair of tones, the 
experimenter gave a ready signal, then struck one fork, damped it; and 
struck the other fork, damped it. Durations actually varied because 
of the human element in the operation of these items of the test, but 
special effort was made to sound each tone as nearly five seconds as 
possible. Care was taken to strike the forks each time in the same 
manner. During a single sitting, each subject was presented with each 
pair of stimuli fifteen times, seven or eight pair with the standard fork 
struck first; seven or eight with the comparison fork struck first. 
Pairs were presented in random order and judgments recorded by the 
experimenter’s assistant as H or L as subject reported the second tone 
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to be higher or lower than the first tone. As soon as the subject had 
given his decision, he was informed as to its correctness. A sitting 
required approximately one-half hour. On five succeeding days 
the identical procedure was repeated with each subject. Distracting 
conditions and interruptions which entered into the rehearsal of the 
experiment with the preliminary group were removed in the main 
experiment. 

The subjects in the preliminary group were nine college students, 
none of whom was pursuing the music curriculum. Inthe main group, 
twenty-three students who were not pursuing the music curriculum 
were selected from the student body, eighteen of whom were women 
and five of whom were men. Of these twenty-three, nine reported 
some degree of musical training other than that obtained in the public- 
school music classes before they entered college. 


RESULTS 


Data from the preliminary experiment are given in Table I as 
simple percentage of correct judgments by the nine subjects on each 
day. 

TaBLE I.—PERCENTAGE OF CORRECT JUDGMENTS IN PRELIMINARY GROUP IN 
SuccEepinG Sirrinecs 


First Second | Third | Fourth | Fifth 
day day day | day | day 





Sittings 





SE a nee ee Se 5, 74.4 | 79.7 84.3 | 87.3 | 79.9 





Such a means of representation is inadequate for exact work, but 
does indicate the presence of some improvement in the group during 
the series. 

In the main experiment, two methods of calculation were used. 
First, taking the group as a whole, the percentage of correct judgments 
for each pair for each day was computed. The Spearman formula was 
then applied to the resultant figures, both when they were grouped 
according to order of presentation and when they were thrown together. 
Since the writer was here concerned with only the upper half of the 
complete distribution of the thresholds, it was necessary to make the 
assumption that equal numbers of H and L judgments would have been 
given for equal forks. Though such an assumption is doubtless in 
error to a slight degree, it is hardly materially so, since the smallest 
interval, the only one it affects, is given so little weight in the Spearman 
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process. A further implication of the process of using half a distribu- 
tion in this manner is that the resultant “threshold” is really an 
average deviation of the measures concerned. 

In a second method of calculation, the Spearman process was 
applied to each subject individually for each day. This second 
method has the advantage of making possible computation of the 
standard deviation and correlations of the thresholds of the group. 
For comparison of thresholds from day to day, however, it is probably 
not as satisfactory as the first method, because of the considerable 
number of first-order reversals in the individual data. When all 
individuals are treated as a group, first-order reversals disappear. 

In Table II are shown the thresholds (A.D.’s) and constant (time) 
errors computed by the group method, the average thresholds and 
standard deviations computed from individual data. 


Tas_e II.—TuresHo.ps or Pitcn DiscRIMINATION IN CE’s, SD’s, anpD CoRRELA- 
TIONS ON Successive Days or TRAINING 





} j 
Group | Average 





| lc , 
Day threshold | © CE* | threshold sp. . | Geet 
| (d.v.) | | (d.v.) | with Ist day 
vv.) | | v. 
sae Os Ba See 
: | - tae eee 7.76 6.28 
2 6.51 | +.365 | ._6.55° | 5.14 81 
3 | 8 | +45 | S. 3.96 74 
4 | 5.20 | +.15 | 6.39 | 4.84 69 
5 <a --— | SH | 8. -| .79 


| | 





* Plus sign for Constant Error indicates overestimation of second 8. 


Figure 1 shows the change in threshold over the period of five days 
as represented by the two methods of computation. There is no 
doubt but that improvement in threshold occurred, being of the order 
of fifty per cent. Possibility of decrease in the constant error is 
discernible. The standard deviation decreases through the series, 
keeping pace with the average so that the coefficient of variation shows 


. no consistent trend. 


It would be possible for these figures to represent improvement in 
thresholds only in those who made a poor record in the beginning, but 
further examination shows otherwise. Average errors decrease and 
average thresholds grow smaller with practice when knowledge of 
results is known. Such changes in group averages do not depend on 
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marked changes in a few cases. For the twenty-three subjects, final 
thresholds were better than the initial in twenty cases, equal in two 
cases, and poorer in only one case. These latter three all involved 
thresholds of less than 1d.v. Since their thresholds, therefore, depend 
on very few judgments (on the small intervals) it is doubtful whether 
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they should be regarded as definite evidence of failure to improve. As 
a further check on the relation of improvement to initial score, the 
better half of the subjects was compared with the poorer half. Aver- 
age thresholds for these two groups are shown in Table III. Sub- 
stantial improvement is seen in both groups, though the poorer group 
made about twice as much gain in terms of percentage. 
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TaBLeE III.—ComPpaRIsON oF THRESHOLD IMPROVEMENT IN THOSE INITIALLY 
PooreER AND THOSE INITIALLY BETTER 











Grou Initial Final Per cent 
oe ae threshold | threshold | improvement 
I 565 535.5. 0'Kn:g SAS Ra aidsts se wee 2.28 1.62 29 
i ini. e-w'ois. 0-09 soso sulin 40a habe 12.69 5.44 58 














Examination of individual learning curves shows many ups and 
downs, but in no case was there anything which could be regarded with 
certainty as a sudden drop to a new level. On the second day, twelve 
made better records than on the first; on the third day fifteen made 
better records than on the second; on the fourth day, twelve made better 
records than on the third; and on the fifth day, twelve made better 
records than on the fourth day. Hence there is no evidence for cessa- 
tion of improvement in any large number of cases before the end of 
the series. 

The two cases showing the greatest absolute improvement moved 
from initial thresholds of 21.77 and 19.01 to 5.54 and 3.46, respectively. 
Disregarding for the moment the three cases where no improvement 
occurred, the two cases showing least improvement moved from 2.15 
and 2.66 to 2.05 and 2.47, respectively. On the last day the highest 
threshold was 9.98 and the lowest .39. 

The reliability of predictions of relative standing from the first 
measure to later measures is indicated by the correlations of later 
measures with the first. These correlations are somewhat higher than 
the strict reliability coefficients found by Stanton,® and resemble more 
those reported by McCarthy.‘ They are, of course, of very limited 
practical significance, as they apply to a condition in which all subjects 
were given equal training. If special training were given to the 
poorer subjects, predictive probabilities would not be represented by 
these figures. But in the writer’s opinion these correlations do not 
indicate reliability, for a practice effect is present, and in a Pr of 1.00, 
the indication of perfect reliability, practice effect does not exist at all. 


DISCUSSION AND CONCLUSIONS 


To classify individuals permanently on the basis of their pitch 
discrimination thresholds in a single half-hour test is evidently impossi- 
ble. Under favorable conditions, training effect is large and the 
individual in the poorest class originally may gradually move up tothe 
best class in a week’s time. This result is perhaps what would be 
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expected from the work of Smith* who found a certain amount of 
improvement with practice without knowledge of results, and Stanton’ 
who found some degree of improvement either with age or as a result 
of musical training. 

The interpretation of these writers, however, is more complex, as 
indicated by Smith’s statement that “‘there is no evidence that the 
physiological threshold improves with practice.”” Following Seashore,’ 
a distinction is made between the cognitive and physiological thresh- 
old, a distinction which seems to require examination. The chief 
considerations for and against the two types of thresholds may be 
summarized as follows: 

(1) Some subjects when given instruction show improvement in 
threshold. The original thresholds are designated as cognitive. 

(2) Some subjects (in Smith’s* experiment, for example) show 
improvement with practice in pitch threshold. The original thresholds 
are designated cognitive, and practice is regarded as a poor form of 
instruction. 

It would be as logical to reverse this interpretation, and maintain 
that instruction, since it contained illustration, was a form of practice. 
But why should not practice and instruction be regarded as both effec- 
tive, and a combination of the two as more so than either singly? The 
greater the effect from the combination seems to be shown in the 
present experiment. 

(3) Some subjects (in Smith’s* and other experiments) do not show 
improvement in threshold with practice or instruction, Such thresh- 
olds are designated physiological. The physiological threshold is 
said to be “‘set by the limits of capacity in the end organ”’ (Seashore’), 
which apparently makes non-improvability in the final criterion. 

Since the “ physiological threshold”’ is defined as a non-improvable 
function, it is beside the point to look for the evidence of constancy or 
improvement, for if improvement occurs in a measured threshold, as in 
Stanton’s® and Smith’s* data, the threshold is immediately taken to be 
cognitive. That the physiological threshold is a limit beyond which 
learning does not go is, of course, a function of a particular learning 
situation. A longer series or another sort of procedure might produce 
a new “limit.”’ Thus in the present experiment the use of knowledge 
of results made it impossible to regard any of the original thresholds as 
“physiological.” 

(4) To distinguish between physiological and cognitive thresholds 
at a single sitting is left largely to the intuition of the experimenter. 
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Low thresholds and small m.v. are taken as signs of physiological 
threshold but “no single sign can be relied on’ (Seashore’). The 
experimenter must take into account the remarks of the subject, his 
manner, etc. 

Insofar as this method of distinguishing is liable to test, it failed 
with regard to subjects of the present experiment, since they improved 
regardless of threshold or m.v. 

On the whole, the evidence for recognizing two kinds of threshold 
seems insufficient to the writer. The physiological threshold as a 
general limit of attainment is too broad a generalization from a single 
kind of experimental procedure. Even within one experimental situa- 
tion there would seem to be no reason for assuming that learning 
approaches a limit in any kind of performance until that fact is demon- 
strated adequately. It is so seldom possible to say with certainty that 
a limit has been measured in a series of experiments that, to the writer 
at least, the concept seems of doubtful utility. If this is abandoned, 
there is no further use for the term “cognitive threshold”’ either, as its 
main connotation is simply ‘‘non-physiological.”’ 

The names of the two thresholds are likewise apt to be misleading. 
There seems to be no cogent reason, for example, why a non-inprovable 
threshold should be regarded as limited by the characteristics of the 
receptor organ rather than by the conditions of any other parts of the 
system involved. Any way, lacking definite proof, should an improv- 
able function be considered as ‘“‘cognitively” limited, as though 
“cognitive’’ factors were the only ones ever subject to modification? 


SUMMARY 


(1) Average improvement in. pitch discrimination in a group of 
twenty-three subjects amounts to approximately fifty per cent in five 
days with the technique used. 

(2) All but three of the group showed improvement, the three being 
doubtful cases. 

(3) The half of the group initially poorer made a fifty-eight per 
cent improvement; that initially better, a twenty-nine per cent 
improvement. 

(4) Correlations of the first with later thresholds range from .69 
to .81. 

(5) Future ability in pitch discrimination cannot be predicted over 
even short periods of time if special training is given in the interim. 
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(6) It is doubtful whether any distinction between a cognitive and 
physiological threshold is valid. 

(7) If the term physiological threshold is retained, the results of 
this experiment indicate that such a threshold is not reached before the 
end of five sittings when knowledge of results is given. In simpler 
language, the pitch threshold improves during practice with knowledge 
of results. 


BIBLIOGRAPHY 


1. Cameron, E. H.: ‘‘The Effects of Practice in Discimination of Singing Tones.”’ 
Psychological Monographs, 1917, Vol. xxi, pp. 159-180. 
2. Gough, E.: “The Effect of Practice on Judgment of Absolute Pitch.”’ Archives 
of Psychology, 1922, Vol. xivu1, pp. 1-93. 
3. Humes, A. F.: ‘“‘The Effect of Practice on the Upper Limen for Tonal Dis- 
crimination.”” American Journal of Psychology, 1930, Vol. x.11, pp. 1-16. 
4. McCarthy, D.: ‘‘A Study of the Seashore Measures of Musical Talent.” 
Journal of Applied Psychology, 1930, Vol. x1v, pp. 437-447. 
5. Mull, H. K.: “The Acquisition of Absolute Pitch.” American Journal of 
Psychology, 1925, Vol. xxxv, pp. 369-493. 
6. Mursell, James L.: The Psychology of Music. New York: Norton, 1937, p. 74. 
7. Seashore, C. E.: ‘‘The Measurement of Pitch Discrimination: A Preliminary 
Report.” Psychological Monographs. 1910-1911, Vol. x11, pp. 21-60. 
8. Smith, F. O.: ‘‘The Effect of Training in Pitch Discrimination.’’ Psychological 
Monographs, 1914, Vol. xvu, pp. 1-103. 
9. Stanton, H. M.: ‘‘Measurement of Musical Talent.” University of Iowa 
Studies: Studies in the Psychology of Music, Vol u, 1935, pp. 1-140. 
10. Whipple, G. M.: *‘Studies in Pitch Discrimination.”” American Journal of 
Psychology, 1903, Vol. xtv, pp. 553-573. 














wre T= Ww 





Sat Rap ON ee ‘ ae 


LATA ME ER ee SS 


i RARE AST 5 ee § 


a ee 


AN EXPERIMENTAL COMPARISON OF THE MULTIPLE 
TRUE-FALSE AND MULTIPLE MULTIPLE-CHOICE 
TESTS 


LEE J. CRONBACH 


State College of Washington 


In an earlier article,? the writer has commented upon the test 
exercise, similar to the multiple-choice test, in which more than one 
alternative is correct, and in which the student is required to select all 
correct choices. It was proposed that this form of test be considered a 
multiple true-false exercise, rather than a multiple multiple-choice 
exercise ; suggestions for constructing the test based on a priori reason- 
ing were made. 

An essential proposal involved was that the student should indicate 
whether each alternative was true or false, instead of merely noting 
“true”’ alternatives as he would in taking a multiple-choice test. It is 
important to determine on experimental as well as on logical grounds 
whether such a proposal is justified. For this purpose, such a test has 
been prepared and administered to two groups, one group following 
each form of instructions. 

During analysis of the data so obtained, a number of facts emerged 
relating to a rather different question which has implications for all 
true-false tests. Previous writers have noted that students frequently 
mark more items “true” than “false”’ in taking true-false tests. The 
present discussion will advance the hypothesis that this tendency to be 
“acquiescent ’’—marking ‘‘true”’ rather than “false’”’ when in doubt— 
rather than ‘critical’? may be a personality trait, and that such a trait 
may influence the validity of scores on true-false tests. 


PROCEDURE 


The test used in this study covers chapters in Woodworth’s 
Psychology, Fourth Edition, dealing with Observation, Vision, and The 
Other Senses, and the accompanying lectures. The test was one of 
several examinations in a college course in general psychology. This 
material is appropriate to a test of the multiple-item form, as these 
chapters contain a few principles which were judged especially impor- 
tant. To test these ideas by one question each, as is customary in the 
usual objective test, would mean that each was tested only briefly and 
unreliably; further, a large proportion of the test would deal with the 
residue of more trivial facts. In the multiple-item forms, each item is 
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devoted to one fact, generalization, or topic; but as each item contains 
several sub-items, one has in effect asked several questions on the fact. 
Two items selected from the test, one covering the Weber law and the 
other sound localization, will demonstrate the form used: 


If one can just barely perceive the difference between a 9-inch line and a 
10-inch line, he will just be able to detect the difference between 
(a) A 10-inch line and an 11-inch line. 
(b) A 9-foot line and a 10-foot line. 
(c) An 18-inch line and a 20-inch line. 
(d) An 18-inch line and a 19-inch line. 
(e) A 30-inch line and a 32-inch line. 
One is likely to estimate the location of a sound inaccurately if 
(a) He hears through a pseudophone but sees the source of the sound. 
(6) He is in a room where echoes are completely absorbed by sound- 
proofing. 
(c) The sound is closer to one ear than the other. 
(d) He hears with only one ear. 
(e) He is blindfolded and the sound is made just as close to one ear as 
the other. 


In all, twenty-two items, each containing five sub-items, were used. 

The test items were mimeographed in the form shown, identically 
for all students. Two sets of directions were prepared on a separate 
sheet, one or the other set being attached to each examination paper. 
Set A (multiple multiple-choice) read as follows: 


This is a multiple-choice test, but is different from the usual type. Instead 
of selecting the best answer to each item, you are to mark every choice which 
correctly completes the statement. In most items, there will be more than 
one correct answer, but sometimes only one will be correct, or possibly none. 

Number your answer sheet for each page as usual, starting with 1. After 
the number of each question, write the letters of all the correct choices. If 
none of the five answers is correct mark the item 0. (A sample item followed.) 
. . . You will be penalized if you mark an incorrect answer as correct. If you 
are not certain about an item, but have some information, you are advised to 
guess. 


Set B (multiple true-false) was similar, with appropriate changes in the 
first paragraph, and the second paragraph changed to: 


Number your answer sheet for each page as usual, starting with 1. On 
the line after each number, place five letters, a, 6, c, d, and e, with a space 
between each. After letter a, place a + if phrase (a) completes the statement 
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correctly; if it is not a correct completion, place a 0 in the blank; make the 
same decision, marking + or 0, for each of the five phrases. (Sample item 
followed.) . . . You will be penalized if you mark a false statement true or a 
true statement false. If you are not certain, but have some information, you 
are advised to guess. 


Of the one hundred and ten sub-items, fifty-five were keyed “true.” 
Some exercises had five “‘true”’ choices, one had none; the number of 
items having each number of “true”’ sub-items varied approximately in 
accord with chance expectation with a mode at two “‘true”’ responses. 
Contrary to previous suggestions, students were directed to guess if 
uncertain. If the merit of the multiple true-false test is to be studied 
thoroughly, it must be tried with both “guess” and “do not guess”’ 
instructions. The use of the former in this study was determined by 
the fact that results from it, with few omissions, should be simpler to 
analyze. A further experiment using “do not guess’’ directions, is 
still to be performed. 

The test was announced as usual and given to two class sections in 
successive periods. In each group, students in alternate seats took 
Form A, while the remainder took Form B. The two groups were thus 
made as equal as one would expect from chance division, but were not 
equated. Sixty students were in the A group, fifty-seven in the B 
group. The form of test was new to all the students tested. 


RESULTS 


Time Required for Each Form.—The time of starting the test was 
recorded, and the time of finishing tallied as each paper was turned in. 
Forty-seven minutes of the period remained after the signal to begin 
was given. Students were accustomed to working later than the clos- 
ing bell if necessary. As every paper was turned in within one minute 
of the closing bell, it appears that no student was pressed for time. 
Students were permitted to check over their papers before turning 
them in. 

» As the data indicated no difference between the students and situa- 
tion in the two periods, data have been combined in this and following 
computations. The mean time used in Group A (those taking Test 
A—amultiple-choice) was 30.8 minutes; in Group B (true-false), 31.2 
minutes. The difference is not statistically significant, and is too 
small to be of practical importance. The added time required to mark 
every sub-item, instead of only those believed true, is apparently not a 
drawback of the multiple true-false plan. 
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While these data present no real comparison of multiple-item forms 
with the usual single item true-false and multiple-choice types, it will 
be noted that the time per item is much higher than typical times 
reported for multiple-choice and true-false tests.*.’*!* The average 
pupil, in studies Ruch reviewed, answered five or six such items per 
minute. Whether the present test be conceived as a twenty-two-item 
or a one-hundred-and-ten-item test, the speed of response is markedly 
lower. To compare the multiple and single item forms with certainty, 
it would be necessary to experiment using questions of equal difficulty 
and similar subjects, with the factor of novelty held constant. The 
present data support a comment by Flanagan: 


In most of our test preparation we are principally concerned with the 
validity of a form of item per unit of time. It is my judgment, based on some 
experience with these types of items [multiple multiple-choice], that they are 
not superior to, and may be slightly inferior to the regular five-choice items in 
this respect. It is very difficult to generalize about such matters, however, 
since there are places in which such items are appropriate and probably useful.* 


Difficulty of Tests—The R-W formula was used for scoring each 
test. The mean score in Group A, with 55 possible, was 26.9, with 
a sigma of 7.6. The mean in Group B, with 11C possible, was 52.4, 
with a sigma of 13.5. Each paper in Group B was also rescored dis- 
regarding all items marked false; this is equivalent to scoring the test 
on the multiple multiple-choice pattern. The mean score for Group B 
papers rescored on the Group A pattern was 26.3, with a sigma of 6.8. 
The difference between Group A and Group B in score is only one-half 
its standard deviation. On the criterion score described below the 
mean of Group A was 301.3; that of Group B was 292.5. This differ- 
ence is 1.1 times its standard deviation. These data imply that the 
form of test does not increase or decrease performance. The standard 
deviation of criterion score in Groups A and B, respectively, were 40.1 
and 43.4; the difference was reversed in the standard deviations of the 
experimental test, but the difference is not statistically significant. 

It may be noted that when no omissions occur, scores on the 
multiple true-false pattern can be converted to scores on the multiple 
multiple-choice pattern by the formula 


Score A = (Score B + 2 — y) + 2, 


* In correspondence with the writer, May 6, 1941. ‘“‘We” refers to the Codper- 
ative Test Service. 
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where A is the latter score and B is the former, z is the number of sub- 
items keyed true, and y is the number keyed false. It follows that 
when students make no omissions the two forms are equivalent. In 
this experiment, with instructions to guess, omissions were made by 
seven of the fifty-seven students in Group B; none of these made over 
two omissions. It, therefore, appears that in this situation which 
form of test was used made no practical difference. The multiple- 
choice form does not indicate definitely whether a student is omitting 
or considers a sub-item false; if any person were to omit frequently 
this would be significant information. ‘‘Do not guess’’ instructions 
would be expected to increase the importance of knowing whether a 
response is an omission or a judgment of “‘false.”’ 

Tendency to “‘ Acquiescence.’-—Under the multiple-choice pattern, 
indicating only “true’”’ judgments, the student might mark more items 
than he would call “‘true’’ under the true-false pattern. Perhaps, on 
the contrary, when required to mark every item, the student would 
mark more “true” under the latter plan. The number of choices 
indicated by students in Group A, and the number of “true”’ indica- 
tions by those in Group B, were scored; this score will be called the 
“aequiescence”’ score. The mean for Group A was 59.6, with a sigma 
of 7.41; for Group B, 62.2, 5.04. The difference in means is 2.2 times 
its standard deviation and supports the second hypothesis, that 
marking every item causes the student to indicate some alternatives 
as “correct’’ which would otherwise be left blank. The difference in 
standard deviations is 2.9 times its standard deviation, and indicates 
that the true-false form may lead to less variation in acquiescence 
score from student to student. A comparison of the two forms using 
the same subjects is required to determine finally the effect of direction- 
set on tendency to acquiescence. Other comments on the acquies- 
cence factor appear below. 

Reliability —The reliability of each test was obtained by correlating 
the score on odd items (not sub-items) with even. The Spearman- 
Brown formula was applied, giving reliabilities of 0.625 and 0.428 for 
Groups A and B, respectively. When the value for Group A was 
corrected for the difference in variabilities of the groups, it reduced to 
0.530. The difference is not statistically significant. 

Validity.—As a criterion of validity, the student’s total perform- 
ance on all other tests during the semester was obtained by adding his 
T-scores on four one-hour examinations and the two-hour final. The 
correlation of score with criterion in Group A was 0.584; in Group B, 











538 The Journal of Educational Psychology 


0.598. When the validity in Group A is corrected for the difference 
of variability of the groups in criterion score, the coefficient becomes 
0.620; the difference between the validities of Tests A and B is not 
significant. 

Ease of Scoring.—No record was made of the speed and accuracy of 
scoring. It was the impression of the scorer that the A form was 
somewhat simpler to score, but that the difference was slight. 

Internal Analysis of Multiple True-false Test Form.—An internal 
investigation of Test B was made to determine the characteristics of 
items having few, and items having many choices keyed “true.” A 
similar analysis could have been made for Test A, but it is believed 
that such a study would agree with the data here presented. The 
test contained few items in which all sub-items were correct or incor- 
rect, and many in which two or three of the five choices were correct. 
The mean score of the group on each item, and the standard deviation, 
were computed. These statistics were averaged for items having the 
same number of “true” sub-items; results are shown in Table I. 
It is apparent that the difficulty of items decreases as the number of 
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VARIANCES 
Number of Number of Mean of mean Mean of item 

“true” sub-items items scores on items sigmas 

5 1 4.11 1.50 

4 4 3.04 1.62 

3 4 2.58 1.81 

2 10 2.15 2.12 

1 2 1.48 2.36 

0 1 1.21 2.29 
ea | a 2.38 1.97 











true sub-items increases, and that the items with a small number of 
true choices have a comparatively high standard deviation and a 
higher effective weighting in the total test score than items with many 
true choices. The correlation between number of true choices and 
item means is +0.55; between number of true choices and item sigmas, 
—0.64. These conclusions cannot be extended to other tests, but it 
appears that in this situation, at least, it was difficult to construct 
several true choices for any item without some being too obvious to 
the student. 
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THE HYPOTHESIS OF ACQUIESCENCE 


It has been pointed out several times that when the student is 
uncertain of the answer to a true-false item, he is likely to respond 
“true” rather than “false.’’!-+5.6.5.>-365-366 The data on “acquies- 
cence’’ presented above show that the mean number of items marked 
“true” by Group B was 62, although by the key fifty-five items were 
true and fifty-five false. It has been customary in the past to con- 
sider this excess of “‘true for false”’ errors to be a general law of guess- 
ing, although Weidemann has commented: 


The temperament of examinees may cause some to respond more often 
true than false and others to respond more often false than true to the same 
item in a given true-false examination. . . . In the case of a large group of 
examinees temperament might be considered a chance factor, compensating in 
such a way that the tendency for some examinees to respond more often true 
than false would be offset by the tendency of others to respond more often 
false than true.® 


Fritz commented that his data, in which sixty-two per cent of 
guesses were “true,” suggested a “native tendency to believe” on 
the part of students, but did not remark on individual differences 
in this tendency.‘ It seems probable that when two students guess 
with the same frequency, one will tend more than the other to respond 
“true” more frequently than “false.” No guess is a completely 
random response; even the student without knowledge consciously 
reacts to the tone and general character of the statement. Whether 
he is willing to accept a statement that sounds authoritative or is 
suspicious of plausibility may be a personality trait or a mind-set 
that influences his performance on all tests of this sort. Lentz, working 
with personality tests, notes that there are individual differences in 
readiness to accept statements, marking them “Yes” or “True,” 
and that this factor, which he named “acquiescence,” is likely to 
distort the validity of scores.’ In analysing personality, attitude, 
and achievement tests, the writer has encountered data which agree 
with Lentz’ finding, and suggest that the personality trait may 
influence scores on any two-choice test, or test of the form, “‘Like- 
Indifferent-Dislike,”’ or equivalent. 

How the tendency to mark more items “true” than ‘‘false”’ 
would influence a true-false test score can be explained with hypotheti- 
cal data. A more complete treatment based on probability theory 
yields the same results. If just half the items in a ten-item test are 
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keyed true, the mean score of any student who guesses will most 
likely be zero (scored R-W), no matter whether he marks three, five, 
or seven of the items true. But, whereas for the student who marks 
as many items “‘true”’ as “false,’’ guessing on all items, the score can 
range from +10 to —10, for the acquiescent student who guesses 
and marks seven items “true,” the possible range is +6 to —6. It 
follows that when test items are equally divided between true and 
false, acquiescence does not, on the average, raise or lower scores, but 
does decrease the spread of scores. If, in constructing the test, 
seven of the ten items had been keyed “true,” a different situation 
would arise. The student guessing and marking half the items 
“‘true’’ can receive scores from +6 to —6, with the most probable 
score 0. The acquiescent student, marking seven of his guesses 
“true,”’ can receive scores from +10 to —2, with the most probable 
score +4. In other words, the tendency to be acquiescent will be 
associated with a high score if more than half the items are keyed 
“‘true,”’ and a low score if less than half are keyed “true.” This 
suggests that validity will be increased by rigidly restricting the 
division of items to an exact fifty per cent “true,” fifty per cent 
“false,” in test construction; even so, acquiescence will affect the range 
of scores. Few students will guess throughout the test; in practice, 
guessing is greatest on the more unfamiliar items. It follows that 
one should equate the number of true and false items not only through- 
out the test, but at each level of difficulty. 

One would expect the acquiescence tendency to be most operative 
where students are least certain, therefore negatively correlated with 
knowledge. The correlation between number of items marked “true” 
and test score in Group A was —0.093; in Group B, —0.433. The 
reliability of the “‘acquiescence” score was 0.242 in Group B. It is 
not surprising that this value is low, since several factors other than 
the acquiescence trait interact to give the final number of items marked 
“true”; only where every response is a guess would acquiescence 
alone be measured. The implication that the ‘total items marked 
‘true’”’ score is not a valid measure of the hypothecated trait “ acquies- 
cence”’ may also explain why no relation was found between this 
score and test validity. The validity of the test, with total trues 
held constant, was 0.580 in Group A and 0.550 in Group B; this is 
not substantially different from the zero order values of 0.584 and 
0.598, respectively. Further studies of the question of acquiescence 
as a trait affecting performance on such tests are required. 
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The influence of high acquiescence, if this variable is a cause and 
not a result, would be to increase a student’s score on items keyed 
“true” and reduce it on items keyed “false.” This thought led to 
dividing Test B into two new fifty-five-item tests, one containing all 
the items keyed ‘“‘true,”’ one, those keyed “‘false.’”” While this is a 
simple procedure, no similar study has been encountered in the 
literature. Reliabilities for the tests were obtained by splitting, odd 
against even items (not sub-items); the Spearman-Brown formula 
was not applied. Validity was obtained by correlating the student’s 
score on each group of items with the criterion described above. 
Results are shown in Table II. Both the difference between the 
reliabilities and that between the validities of the true and false tests 
are statistically significant. The results from the analysis of these 
data are of course only representative of this particular set of items; 
the results are sufficiently striking to encourage considerable further 
investigation. 

TaBLe I].—REvIABILITY AND VaALipity oF A Test oF Firry-rive Items KEeYEep 
“TruE” aND A Test oF Firty-Five Irems Kerep “Fause” 
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Test rys4 “ae 
| criterion 
True sub-items. —0.149 | 0.319 
False sub-items. +0.373 | 0.700 





DISCUSSION AND SUMMARY 


The comparison of the multiple multiple-choice and multiple 
true-false tests, with instructions to guess, disclosed little significant 
difference between them. Under these circumstances few omissions 
were made, which makes the tests virtually interchangeable. The 
multiple-choice type of test has slightly higher reliability and seems 
slightly easier to score. Under the true-false form the student may 
mark more items “‘true”’ than he would choose under multiple-choice 
directions. Evidence from this study would support the use of the 
multiple-choice rather than multiple true-false form, if omissions were 
not expected. Further study using do-not-guess instructions is 
required. 

The evidence that the difficulty and spread of items are related to 
the number of true sub-items casts doubt on the previous suggestion 
that the number of true sub-items should vary from none to all in 
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various items. Since items with few true alternatives have the great- 
est effective weight, it may be wise to use few items having the majority 
of sub-items true. Such a procedure would penalize the more acquies- 
cent student. It is possible that more careful construction of items 
with many true alternatives may increase their effective weight. 

The hypothesis that the trait, acquiescence, may affect one’s score 
should receive further study. It is clearly unwise to include per- 
sonality tendencies in an achievement score if validity is desired, 
Statistical evidence in this study fails to demonstrate that acquies- 
cence affects test validity. If this factor should be found a genuine 
influence, it may be removed by instructing students that the given 
test contains exactly fifty (or whatever other) per cent of items keyed 
“true,” and advising them to adjust their responses accordingly. 
Dunlap, de Mello, and Cureton have proposed directing the student 
not to guess, but after counting the number of “true” and “false” 
responses among the items he was certain of, to mark all remaining 
items with the symbol least used in those answers. Their experi- 
mental data showed these directions superior to instructions to guess, 
but slightly inferior to do-not-guess instructions.* It is possible that 
it is still better to advise the student to mark—not all uncertain items 
by one symbol—but enough of the items true and enough false to make 
the total frequency of “true” and “false’’ equal on his answer sheet. 

The evidence that a test composed of false items is more reliable 
and valid than a test of equal length containing true items may be 
specific to this test. If further studies should show this true of most 
true-false tests, one of course cannot conclude that only false items 
should be used. But there may be justification for raising the per- 
centage of false items to sixty or a similar point, provided that the 
penalty this places on acquiescence is removed. A rational basis for 
greater validity in false items lies in the acquiescence hypothesis. 
If acquiescence affects principally the scores of those who guess 
(hence are incompletely informed), and if it acts to lower scores on false 
items while raising chances of success on true items, it follows that 
score on items keyed “false’’ will be more highly correlated with 
certainty, t.e., knowledge. 
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SELECTION OF UPPER AND LOWER GROUPS FOR 
ITEM VALIDATION 


G. FORLANO AND R. PINTNER 
Teachers College, Columbia University 


Among test constructors of today objective item validation plays 
an important part in the process of constructing a test. Long and 
Sandiford* have made a survey of previous studies in the field of item 
validation and have presented additional experimental evidence 
concerning the performance of various methods of item analysis. 
Pintner and Forlano‘ have reported the results of a comparison of 
nine methods of item validation. Their results seem to indicate that, 
other things being equal, there is little difference as to which method of 
item validation is used in the selection of the best items for a subtest 
in the construction of a personality test of the inventory type. 

In the studies to which we have referred, various methods were 
surveved, experimented with, and discussed. Should Pintner and 
Forlano’s report be confirmed by the results of future studies then 
one of the next steps in the field of item analysis would be to experi- 
ment with and analyze more fully one of these methods. Of these 
methods of item selection, probably the simplest and easiest one to 
apply is the Upper and Lower Halves Method. As far back as 1929, 
Ruch® used the Upper and Lower Halves Method in validating test 
items. Long and Sandiford reported, “that the Upper and Lower 
Halves, which uses all of the data at the disposal of the examiner, is 
not so good a method as the Upper and Lower Thirds or the Upper 
and Lower twenty-seven per cent.’”” The next question one may ask 
is, Is the Upper and Lower twenty-seven per cent method better 
than the Upper and Lower Third Method? Should the former 
technique produce better results from the point of view of increased 
reliability, then an additional advantage of the Upper and Lower 
twenty-seven per cent Method would be a saving in time and energy 
to those who use it, since but fifty-four per cent of the cases need 
be tabulated as against 66 per cent of the cases which must be tabulated 
when using the Upper and Lower Thirds Method. 

In 1933 Votaw*® reported, ‘‘that it was revealed by Jensen,' who 
in turn gave credit for the development of the proof to Dr. T. L. Kelley, 
that the size of the upper and lower categories should each be twenty- 
seven per cent of the total number of students to produce a maximum 
ratio between the difference of their means and the probable error of 
544 
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the difference.’”’ In a more recent article Kelley? states, ‘‘We, there- 
fore, conclude that if no distinction is made among the members of 
our upper and lower groups separately, when studying the performance 
of items, that we should, in general, select the twenty-seven per cent 
highest on the criterion measure for the upper group and the twenty- 
seven per cent lowest for the lower group.” 

From an empirical or experimental point of view, does the Upper 
and Lower twenty-seven per cent Method give the optimum results? 
Other upper and lower portions of the distribution may be used in the 
selection or validation of items of a test; for example, the upper and 
lower sixteen per cent, and the upper and lower seven per cent. In 
this short paper, results will be presented on the evaluation of the 
following lower and upper groups of a distribution of criterion scores: 

(1) The Upper vs. Lower fifty per cent: Method I. 

(2) The Upper vs. Lower thirty-three and one-third per cent: 
Method II. 

(3) The Upper vs. Lower twenty-seven per cent: Method III. 

(4) The Upper vs. Lower sixteen per cent: Method IV. 

(5) The Upper vs. Lower seven per cent: Method V. 

It will be noted that in Method IV the dividing line starts at points 
+loe away from the mean; in Method V, the dividing line starts at 
points +1.5¢ away from the mean. The criterion of evaluation of 
these five methods may be stated as follows: To be judged the best, 
a method must select items which, when incorporated in a test, pro- 
duce the highest coefficient of internal consistency or reliability. 

Two sets of data will be presented. The first set is based on the 
test results of one hundred girls in grades V and VI of a public ele- 
mentary school. They were given the Study Habits Inventory con- 
taining one hundred items. Each item was to be answered “ Yes”’ 
or “No” by the subject. The second set of data deals with results 
of the Home-Background Survey Test which was administered to 
one hundred boys in grades V to VIII of a public elementary school. 
This survey test contained one hundred ten items concerning an 
individual’s behavior towards his father, mother, and siblings. Each 
item was to be answered “‘ Yes” or ‘“‘No”’ by the subject. 

The procedure followed in analyzing the Home-Background 
Survey Test and the Study Habits Inventory will be illustrated by 
using the latter test as an example. After scoring the Study Habits 
Inventory, the hundred total scores, or criterion scores, were arranged 
in a descending order from highest to lowest. Then beginning with 
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the subject who made the highest score, a detailed summary was 
made of yes or no responses to each one of the hundred items. 

An item analysis was made of each test by each of the five methods, 
Each method was used to select first the best thirty-six items and 
then the best seventy-two items. For example, when the Upper 
and Lower 50 per cent Method was applied to each item, the number 
of rights in the lower fifty per cent of the distribution of criterion scores 
was subtracted from the number of rights in the upper fifty per cent 
of the distribution. The greater the difference in number of rights 
in favor of the upper fifty per cent, the better the item. These differ- 
ences were arranged in a descending order with the greatest plus 
differences at the top. The top thirty-six differences indicated the 
“best’”’ thirty-six items. The best seventy-two items were selected 
in the same manner. The general procedure employed in connection 
with Method I was used with the other four methods. 

It will be recalled that each method was employed to select items 
to form two experimental tests, one composed of the best thirty-six 
items and the other the best seventy-two items. The latter test 
consisted of the former’s thirty-six items plus the thirty-six next best 
items. 

After the five methods of item validation had been applied, we 
had in all ten experimental tests, each of which was of unknown 
reliability. Odd-even, that is, split-halves reliability coefficients 
were computed. The Spearman-Brown prophesy formula was then 
used to find the reliability of the whole experimental test. The 
reliabilities of the 36-item experimental tests based on the Study 
Habits Inventory and Home-Background Survey will be presented 
first. Table I shows the split-halves reliability coefficients for each 
method. 

Taste I.—RE LIABILITIES OF 36-ITEM EXPERIMENTAL Tests SELECTED BY THB 


Frve Metuops FROM THE Stupy Hasits INVENTORY AND THE 
Home-BacKGROUND SURVEY 





| Method | Method | Method | Method | Method 





I | II III Iv | V 
es —— — —— i a - -_ — 
Study Habits Inventory ; 799 .842 752 | «.815 | . 783 
Home-Background Survey .734 718 .826 . 767 831 





An examination of Table I reveals that for the Study Habits 
Inventory Method II, the upper and lower thirty-three and one-third 
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per cent, gives the highest reliability coefficient. Method III, the 
upper and lower twenty-seven per cent, gives the lowest reliability 
coefficient. The significance of the difference in r’s between Method 
II and Method III is but three times its probable error and, therefore, 
not statistically significant. For the home-background material, 
Method V gives the highest reliability coefficient; Method III gives 
the next highest, namely, .826. In general, the results presented in 
Table I indicate that no one method always occupies first rank and 
that the methods employing the least number of criterion scores; 
namely, Methods IV and V, seem to give results almost comparable 
to those for Methods I, II, and III. It is interesting to note that 
for the Home-Background Survey, Methods III and V give higher 
reliability coefficients, whereas for the Study Habits Inventory, the 
situation is reversed. This observation led us to consider the necessity 
for increasing the number of items in each experimental test in the 
interest of increased stability or reliability. 

Table II shows the results for the lengthened experimental tests, 
that is, for the 72-item tests. This table shows that for the Study 


TasLe I].—RELvIABILITIES OF THE 72-ITEM EXPERIMENTAL TEsTs SELECTED BY 
THE Five Metuops rrom THE Stupy Hasits INVENTORY AND THE 
HoME-BACKGROUND SURVEY 





Method | Method | Method Method | Method 





I  ? |} we te Vv 
—_— — - —— | a — — | —— - ' _ —— 
Study Habits Inventory 821 | .828 .823 .819 720 
Home-Background Survey 816 | .835 858 | .776 804 





Habits Inventory, Method I, II, III, and IV produce tests whose 
reliability coefficients are practically identical. Method III is second 
best and Method II occupies first rank, but the difference between 
the two is not a significant one. With the Home-Background Survey, 
Method ITI is first with a reliability coefficient of .858 and Method IT 
is second best. On both Study Habits Inventory and Home-Back- 
ground Survey, Method I maintains the same ranking; namely, third. 
With the study habits material, Method IV is fourth and Method V 
ranks fifth, but with the Home-Background material these two 
methods exchange rankings. 

At this point let us compare the rankings of the various methods 
on the 36-item experimental tests with the rankings of the same 
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methods on the 72-item experimental tests. Table III shows this 
comparison of rankings. 


TaBLe III.—A CompaRISON OF RANKINGS OF RELIABILITIES OF THE FIvE METHOps 
ON THE 36-ITEM AND 72-1ITEM EXPERIMENTAL TESTS 

















| 37-item Tests 72-item Tests 
Method a‘ Say . | 

Study habits, | Home back- | Study habits, | Home back- 
rank | ground, rank | rank ground, rank 

ee, ae ae ANG ar eee se, ae 

I | 3 4 | 3 | 3 

| 5 | 1 | 2 

Ill 5 2 2 1 

IV 2 3 4 5 

Vv ‘ 1 5 | 4 





The shifts up or down are greater for the 36-item tests than for 
the 72-item tests. This detailed discussion of the results of Tables II 
and III seems to indicate that the lengthened tests did help to stabilize 
the results. It was felt that the results of Table II gave a fairer 
picture than those of Table I. If the latter view is a reasonable one 
to take, we may conclude that Methods II or III tend to give the 
better results, that is, the higher reliability coefficients. 

Let us refer once more to the results presented in Table II. After 
an examination of the reliability coefficients for Methods I, II, and IT], 
one may argue that there is little difference between the reliabilities 
for Method III and any of the reliabilities for Methods I and IL 
Although this is so, there remains the fact that Method III could be 
applied more easily and with less time. On the other hand, protag- 
onists for Methods IV and V may claim that the latter two methods 
are easiest of the five to apply and at the same time give reliabilities 
not statistically significantly different from those for Methods I, II, 
and III. But, the data of Table II show that the reliabilities for 
Methods IV and V are consistently smaller than those for Method IIL. 
Therefore, Method III may be considered to be best from the point of 
view of high reliability and economy of time. 

In a recent study to which we have referred above, Kelley’s proof 
for the selection of the upper and lower twenty-seven per cent method 
presupposes that the distribution of criterion scores is a normal one. 
A check was made of the normality of the distribution of our criterion 
scores. The Chi-square test for goodness of fit was applied to the 
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distribution of criterion scores for the Study Habits Inventory and 
for the Home-Background Survey. The value of P for the Home- 
Background Survey was found to be less than one per cent whereas 
the value of P for the Study Habits Inventory was found to be twenty- 
two per cent. With the former test the hypothesis of normality is 
rejected, with the latter test we do not have adequate basis for rejecting 
the hypothesis of normality. It will be noted, therefore, that in this 
short study we have been working with one distribution of criterion 
scores which was not normal and with one which was probably normal. 
It may be of interest to note that with the Home-Background Survey, 
which gave a non-normal distribution of criterion scores, Method III 
in Tables I and II ranks second and first, respectively. 

As has been noted by Kelley* modifications of the upper and lower 
twenty-seven per cent method due to non-normal distributions, while 
theoretically desirable, are hardly feasible practically. From a 
practical and empirical viewpoint, it is concluded that for a simple 
and rapid, rough and ready method of item validation of test items 
of the inventory type, the Upper and Lower twenty-seven per cent 
Method is to be preferred, even though the distributions are more or 
less non-normal. 
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THE RELATION OF MALLER CASE INVENTORY 
SCORES TO INSTITUTIONAL ADJUSTMENT OF 
DELINQUENT BOYS 
DALE B. HARRIS 
Institute of Child Welfare, University of Minnesota 
The widespread use of personality tests in schools and clinics is an 
accepted feature of modern guidance and personnel work. To be sure, 
most clinicians use such instruments with more or less reservation. 
Frequently the reservation is strongest when the test diagnosis is quite 
contrary to the clinician’s hunch, or to the opinion which has acted as 
catalyst in his diagnosis. When opinion and test diverge widely, few 
are the psychologists who consistently accept the discrepancy as 
signifying the need for more data; those with a “‘test” bias tend to 

reject opinion, and those with an “insight”’ bias reject tests. 

Generally speaking, moderate advocates of personality tests try 
whenever possible to validate their instruments against some life 
criterion, such as achievement in concrete situations, performance 
under known conditions of stimulation, or reaction to defined problem 
situations. When the instrument consistently checks with the most 
carefully collected life-evidence, psychologists may then accept the 
paper-and-pencil measure as a convenient, time-saving, and valid 
short-cut in the descriptive or diagnostic process. This paper reports 
the relation of scores on a personality test to rated life adjustment ina 
selected group of subjects. 

One of the better known measures of problem tendencies in children 
has resulted from years of careful work by J. B. Maller. The Case 
Inventory? contains four subtests which yield a total score, a high 
value of which is taken to indicate good personality adjustment. The 
author’s manual indicates that problem children in public schools make 
lower mean scores than do non-problem cases. (Table I.) 

As part of the routine psychological examination program at 
entrance to a Midwest state institution for delinquent boys, the Maller 
Inventory was given to one hundred ninety successive entrants. 
Ninety-seven took Form B, and ninety-three Form A. Maller reports 
norms for the test which indicate that the forms are practically 
equivalent; a group of seventy-two well-adjusted pupils obtained a 
mean score of 115.1 on Form A, and 113.5 on Form B. The boys 
taking Form B in the present study achieved a mean of 105.27, witha 
standard deviation of 13.37, while those taking Form A made a mean 
550 
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Taste I.—Comparinc Mean Test Scores anp STANDARD DeviaTIons oF Two 
DELINQUENT GROUPS WITH CERTAIN VALUES REPORTED IN THE CASE 
INVENTORY MANUAL 











Group N Form | Mean | SD 

| 
Delinquent boys | 93 | A 111.02) 15.67 
Delinquent boys | 97 | B | 105.27] 13.27 
Well-adjusted pupils 72 | A | 115.1 | 11.0 
(Maller’s norms) 72 B 113.5 | 11.4 
Problem boys (14 and 15 vears) ..| 145 d 15.0 
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Maller’s norms) 





score of 111.02, with a standard deviation of 15.67. The differ- 
ence between these means according to Fisher’s formula' yields 
at value of 2.59, which for one hundred eighty-eight degrees of freedom 
is significantly greater than zero; that is, the P value is less than .01. 
The two samples are not then presumed to come from a common 
population. The difference between the standard deviations corre- 
sponds to a z value of .2191, which does not reach the .05 level in 
Fisher’s table of z and is, therefore, presumed to be insignificant. 

The performance of delinquent boys indicates a poorer adjustment 
than that of unselected cases (reported in Maller’s manual). While 
not large, the direction of the difference is in agreement with that 
indicated by Maller. 

If the Maller Inventory were to give some indication of a boy’s 
likely adjustment in the institution, even though the estimate were 
only a little better than chance, it would be of considerable value to the 
staff of the institution. At present there are no known sources for 
prediction of behavior adjustment in an institution other than a sample 
of that behavior itself. To test the ability of this inventory to predict 
behavior adjustment at the institution, the following procedure was 
earried out: 

The names of the one hundred ninety boys for whom scores were 
available were placed on cards, one name to a card. The cards were 
shuffled, and the pile submitted independently to six different raters— 
staff members who had had close contact with the boys for at least ten 
months in the institution. The raters classified the boys according to 
the instructions given below: 

1. There are a number of boys listed on the accompanying slips, one name 
to each slip. Will you sort the slips into five groups, according to the 
following scheme: 
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Group 1. Boys who in your opinion were very serious behavior prob- 
lems in this institution. 

Group 2. Boys who in your opinion were to some extent behavior 
problems in this institution. 

Group 3. Boys who in your opinion were more or less neutral or 
“colorless” in general behavior (not characterized particularly either 
by good or by poor adjustment). 

Group 4. Boys who in your opinion got along well. 

Group 5. Boys who in your opinion got along exceptionally well. 


Read over the above scale carefully and get it thoroughly in mind before 


going ahead with the sorting. If you do not know a boy well enough to make 


a judgment, place the slip in the envelope markex 
3. 


] “or 


Place each group in the appropriately numbered envelope (i.e., pile 


number 1 in envelope 1, pile 2 in envelope 2, etc.). 


The median value of the several ratings received by each boy was 
computed. The subjects were then grouped into five subgroups or 
classes according to the median rating received, as is shown in Table II. 


TaBLeE II.—Dutstrisvutions oF MEDIAN INSTITUTIONAL ADJUSTMENT RATINGS OF 


DELINQUENT SUBJECTS 





Subgroup Median ratings Number of cases 





Boys Taking Form A of Maller 








1 | 1.0, 1.5 5 
2 2.0, 2.5 14 
3 3.0 32 
4 3.5, 4.0 31 
5 4.5, 5.0 11 
eee , 93 





Boys Taking Form B of Maller 











1 1.0, 1.5 3 
2 2.0, 2.5 16 
3 3.0, 3.5 38 
4 4.0 7 
5 4.5, 5.0 13 

97 





The analysis of variance technique was applied to the several 


classes who had taken one or the other form of the test. 


Table III 
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TasLe IIJ.—ANALYsIs OF VARIANCE, SCORES ON Form A OF MALLER INVENTORY 

















T 
Source of variation | Sum of squares} df | Meansquare 

Total..... | 1,149,288.00 | 92 | 
Between groups - 65,782.11 | 4 | 16,445.527 
Within groups | 1,083,505.89 | 88 | 12,312.567 





indicates the results for Form A of the Maller Inventory, and Table IV 
gives the results of this analysis for Form B. 


TaBLeE IV.—ANALYsIs OF VARIANCE, Scores ON Form B or MA.LLER INVENTORY 











Source of variation Sum of squares | df Mean square 
} | 
sscnsidinsnenieenititintgieinanipenieanatiainai = —] s spcuemeitineninmeamatie 
ae | 1,106,338.00 96 
Between groups | 62,944.24 4 15,736.06 





| 11,341.24 


Within groups | 


| 1 ,043 ,393.76 92 





The F ratio of 1.336 as computed from the mean square values given 
in Table III has a probability value of P > .05.* This value fails to 
refute the hypothesis that there are no significant differences among 
the means of the several classes. Had the P value fallen beyond the 
01 per cent level, one could have concluded that groups selected by 
raters differed significantly with respect to mean adjustment test scores. 
As it is, classes adjudged to be quite different by a life criterion 
(knowledge of supervisors) are not proved to differ significantly in test 
performance. 

A similar result is obtained for those taking Form B of the Inven- 
tory. The mean square values in Table IV result in an F ratio of 
1.388, for which the probability value is P > .05. 

One may object that no reliability has been determined for the 
ratings, and this is of course a just criticism. The use of six ratings, 
however, gives a more stable determination than would only two or 
three. The fact that six raters agreed in giving extreme values to a 
number of the cases, as shown in Table II, clearly indicates that there 
was inter-rater consistency—not all values regressed to the mean 
value, which would have been the case had there been no correlation 
among raters. Taking only staff members who knew the boys well 
and were in almost daily contact with them for at least ten months 
insured valid ratings. It is possible that the rating scale itself does 
not refer to an observable dimension of behavior. The raters, how- 
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ever, when asked if their concept of the task was clear, all answered 
in the affirmative, agreed to give the matter careful attention, and 
concluded that their groupings indicated significant adjustment differ- 
ences among the subjects they rated. 

One must recognize, of course, that in dealing with delinquents one 
is restricted largely to one extreme of a behavior continuum. Con- 
sequently, the study reported here is an exceedingly rigorous test of 
the instrument’s sensitivity. However, the results in this case are 
clear. Whatever traits of personality adjustment the Maller Inven- 
tory may measure, it does not appear to measure those traits which are 
related to successful behavior adjustment in the state school of this 
study. 
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ATTITUDES OF DELINQUENT AND NON-DELINQUENT 
GIRLS TOWARD SUNDAY OBSERVANCE, THE 
BIBLE, AND WAR! 


WARREN C. MIDDLETON 
AND 
PAUL J. FAY 


DePauw University 


INTRODUCTION 


This study is an attempt to compare certain attitudes expressed 
by a group of institutionalized girls with those expressed by a group 
of high-school girls, where both groups are approximately equated with 
respect to such factors as age, intelligence, and educational status. 

Three of the Thurstone, et al., scales for the measurement of social 
attitudes (Form A) were used: Attitude toward Sunday observance 
(Wang); Attitude toward the Bible (Chave); and Attitude toward 
war (Peterson). A favorable attitude on each of these scales is 
indicated by a high score. 

The scales were administered to eighty-three girls in the Indiana 
Girls’ School and to one hundred two girls in the Greencastle, Indiana, 
High School. All subjects in both institutions were in the eighth, 
ninth, or tenth grades. In order to secure pertinent information 
about them, a personal data sheet, consisting of fifteen items, was 
used. The clinical files of the Indiana Girls’ School contained Stan- 
ford-Binet IQ’s and a record of the court charge under which each 
delinquent had been committed. Terman Group Test of Mental 
Ability scores were available for all the high-school pupils, and Stan- 
ford-Binet IQ’s were secured for all but a few. 

The non-delinquent girls were found to have a mean IQ between 
four and five points higher than the delinquent girls; both means 
fall well within the range which would be classified as normal. The 
chronological age range is from fourteen to nineteen years, although 
more than fifty per cent of each group are sixteen or seventeen years 
of age. Less than fifteen per cent of both groups are in the tenth 
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grade; this comparative study is dealing, therefore, chiefly with pupils 
in the eighth and ninth grades. 

Eighty-six per cent of the delinquent girls’ fathers, and ninety-two 
per cent of the non-delinquent girls’ fathers, are living; eighty per cent 
of the delinquent girls’ mothers, and ninety-one per cent of the non- 
delinquent girls’ mothers, are living. Approximately eight per cent 
of the delinquents’ fathers are unemployed, as compared with nine 
per cent of the non-delinquents’ fathers. Twenty-nine per cent of the 
delinquents’ parents are reported to own their homes, while forty- 
nine per cent of the non-delinquents’ parents are home owners. 

There is, as one might expect, a rather marked higher percentage 
of broken homes among the delinquent group. Twenty-four per cent 
of their parents are divorced, as compared with only about five per 
cent of the non-delinquents’ parents. Approximately thirty per 
cent of the delinquents report that their parents are separated, 
though not divorced; only four per cent of the non-delinquents indicate 
that their parents are separated. The fathers and mothers of the 
non-delinquent girls are, as a group, better educated than those of the 
institutionalized subjects. Only five per cent of the fathers and one 
per cent of the mothers of the delinquents are college graduates, as 
compared with eleven per cent of the fathers and fourteen per cent 
of the mothers of the high-school pupils. 

Only about thirteen per cent of the delinquents come from rural 
areas, while twenty-one per cent of the non-delinquent group are 
from the country. The largest proportion of delinquents comes 
from communities having a population ranging from two thousand 
five hundred to ten thousand. Approximately eighteen per cent of the 
fathers of both groups are reported to have seen service in World 
War I. The subjects indicate that they had few brothers in the war, 
but between forty and forty-five per cent of both groups had uncles 
who participated. Less than one per cent in each group say that they 
had near relatives who were killed or wounded. 

Sixty-seven per cent of the delinquents report that they are church 
members, while only forty-nine per cent of the non-delinquents say 
that they have any church affiliation. Eighty-seven per cent of the 
former group indicate that they own a Bible or a New Testament, 
as against ninety-two per cent of the latter group. According to 
the reports of the subjects themselves (assuredly not perfectly reliable), 
the institutionalized girls read the Bible more frequently than the 
high-school group. Fifteen per cent of the delinquents report that 
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they have been Girl Scouts, while forty-nine per cent of the non- 
delinquents have been Scouts. 


RESULTS AND DISCUSSION 


The medians, means and standard deviations of the scores of the 
two groups of girls on the three attitudes scales (Sunday observance, 
Bible, and war) are shown in Table I. The differences between the 


(aBpLE I.—DIFFERENCES BETWEEN DELINQUENT AND NON-DELINQUENT GIRLS 
with Respect TO THREE ATTITUDES 
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1 The difference between the median scores of the two groups is printed in the 
row of the group which has the higher median 


median scores, and the reliability of these differences, are also indi- 
cated. It will be noted that only one of the median differences 
(Sunday observance) is in excess of 1.00, a unit of measurement on 
the scale. 

On the Sunday observance scale the scores of the delinquents 
range from 8.6 to 3.2, with a mode of 6.75; the scores of the non- 
delinquents range from 8.4 to 2.8, with a mode of 3.75. The standard 
deviations are comparable: 1.13 for the delinquents and 1.45 for the 
high-school pupils. The median score of the delinquents is 1.42 
more favorable in attitude toward Sunday observance than is the 
median score of the non-delinquents. The obtained difference is 
very reliable (D/cD = 5.92). 

On the Bible scale the scores of the delinquents range from 10.0 
to 5.8; the scores of the high-school girls range from 10.2 to 5.1. The 
two groups have the same mode. The standard deviations show 
some variation: .76 for the delinquents and 1.15 for the non-delin- 
quents. The median score of the delinquents is .24 more favorable 
toward the Bible than is the median score of the high-school group. 
The obtained difference is only fairly reliable (D/oD = 1.33). 
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On the war scale the scores of the institutionalized group range 
from 7.2 to 2.3; the scores of the non-delinquent group range from 7.8 
to 2.1. The mode is the same for both groups. There is practically 
no difference in the standard deviations: 1.15 for the delinquents and 
1.18 for the high-school pupils. The median score of the non-delin- 
quents is .07 more favorable in attitude toward war than is the median 
score of the institutionalized subjects.' However, the obtained 
difference is not reliable (D/cD = .33). 


1 A group of one hundred thirty-nine eighth, ninth and tenth grade delinquent 
boys in the Indiana Boys’ School were administered the same attitude toward war 
scale about ten days after the delinquent girls were tested. A report of the results 
of this investigation may be found elsewhere [Cf. Middleton, W. C., and Fay, P. J. 
‘*A comparison of delinquent and non-delinquent boys with respect to certain 
attitudes.”” J. Soc. Psychol. (in press)]|. By way of comparison it is interesting 
to note that the median score of the boys is .26 more favorable toward war than 
is the median score of the girls. The critical ratio is 1.40. 

















A NOTE ON THE COMPUTATION OF Y VALUES FOR 
INTEGRAL VALUES OF X, WHEN Y IS A LINEAR 
FUNCTION OF X 
JOHN M. STALNAKER 
College Entrance Examination Board and Princeton University 


The increasing use of Hollerith machines and. punched-card 
methods for statistical and research work warrants the description of 
a method for the rapid and accurate determination of converted scores 
for a large number of raw scores. The procedure here outlined applies 
equally well to any situation where it is necessary to transmute one set 
of numbers to another set, providing that (1) the two sets have a linear 
relationship, (2) the one variable changes in integral or unit steps. 

Suppose that an examination has been prepared in American 
History and given to a large population. The first step is the com- 
putation of the mean and standard deviation of the raw scores. This 
can be accomplished by punching on a Hollerith card the individual’s 
name or some identifying number, and his raw score. The speed 
with which this is done depends upon the form in which the data are 
available. These cards, one for each candidate, are then verified 
and, through the use of the progressive totalling method on a tabulat- 
ing or accounting machine, fully described elsewhere,' the values of n, 
=z, and =z? are rapidly obtained. From these constants, one can 
obtain the mean and standard deviation. Suppose that the mean and 
standard deviation of the raw scores are found to be, respectively, 
82.45 and 21.34. The problem is to transmute these raw scores for 
each person of the entire large group into converted scores which have 
a pre-determined mean of 500 and a standard deviation of 100. If 
Y is the converted score and X the raw score, we know that: 

Y—500 X —82.45 


0  orgeCi(‘iakté«*«SY «= 4GOX + 113.64 


A card is now punched with an “‘ X”’ in, say, column 80 and 114.14 
(the constant b plus .5) in a specified field. The tabulator is then so 
wired—through the use of a digit selector—that when any card not 
containing an X in column 80 passes through the machine a constant, 


4.69, will add into the same counter in which has already been added 
the constant 114.14. The machine is set to “break control” for 
1 The Mendenhall-Warren-Hollerith Correlation Method, Columbia University 


Statistical Bureau (Document No. 1), 1929, pp. 49. 
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every card. Another counter is used to add a one for every card 
without an “X” in column 80. A summary punch is attached to the 
tabulator to punch the information in the counters on a set of “‘sum- 
mary” cards. The counters are arranged so that they do not clear 
so long as cards remain in the machine. The name of the test or any 
other desired information may be gang-punched or “duplicated” 
into the summary cards, according to the type of summary punch 
used. When the machines are so set, the single punched card con- 
taining the constant 114.14 and an “X” in column 80 is placed in 
the tabulator and followed by a stack of completely blank cards. 
The machines are started and summary conversion cards are auto- 
matically punched at the rate of about nineteen cards per minute, 
The first several cards might contain the following information: 


American History 0 114 American History 2 123 
American History 1 118 American History 3 128 


The first column of numbers indicates the integral values of X, and 
the second the corresponding values of Y, correct to the nearest 
integer. These master conversion cards will continue to be punched 
automatically until no more blank cards are placed in the feed hopper 
of the tabulator. 

The conversion cards may then be listed on the tabulator, and as a 
check every tenth listing verified by a calculating machine. These 
eards are then collated with the series of raw score cards, already 
arranged in order of score. The cards are then placed in a gang-punch 
and the converted scores are punched, at the rate of one hundred cards 
a minute, into the cards containing the names of the candidates. 
The punching is electrically verified, a procedure which can be done 
at the same time and at the same rate of speed as the gang-punching. 

An additional check can be made by running these same cards, 
in the same order, through a properly wired tabulator so that a control 
break occurs for any change in either converted or raw score. A 
frequency distribution can thus be prepared, inspection of which will 
indicate whether or not any control breaks have occurred at any 
point other than where the raw score changes. 

This method allows for speed, and at the same time gives results 
carefully and fully controlled for error. It demands a minimum of 
hand labor. The procedure has been tried on large-scale tasks and 
has worked satisfactorily. 
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